---

Audit Summary

CODEBASE AUDIT RESULT: MEDIUM AGENT REPRODUCIBILITY: False

---

Detailed Code Audit Report - Submission 23

1. COMPLETENESS & STRUCTURAL INTEGRITY

Critical Issues Found:

#### 1.1 Missing Import Reference in test.py (HIGH PRIORITY)

Location: test.py:13
Issue: Code imports from net.Film import Net but the file net/Film.py does NOT exist
Evidence: Only net/restormer.py exists in the net/ directory
Actual Implementation: The network is defined in new_film.py (root directory)
Impact: Test script cannot execute as written; this is a critical import error that would cause immediate failure
Assessment: This is a significant red flag indicating the test script was not validated before submission

#### 1.2 train.py Uses Correct Import

Location: train.py:22
Code: from new_film import Net - This is CORRECT
Inconsistency: Training script works but testing script does not

#### 1.3 Missing Data Files

Issue: No H5 dataset files included in submission
Required Files:
VLFDataset_h5/MSRS_train.h5 (referenced in train.py:46)
VLFDataset_h5/MSRS_test.h5 (referenced in test.py:22)
Text feature files (.npy files for pre-encoded BLIP embeddings)
Mitigation: README directs users to FILM repository for data preparation
Assessment: While data is often excluded due to size, the lack of sample data or generation scripts makes immediate reproduction impossible

#### 1.4 Missing Model Checkpoint for Testing

Location: test.py:26
Expected: models/IVF.pth (referenced as task_name='IVF')
Actually Provided: models/ckpt_100.pth (24MB file exists)
Issue: Test script looks for wrong checkpoint filename
Impact: Test script will fail on checkpoint loading

Positive Completeness Indicators:

✅ Complete Network Architecture: new_film.py contains fully implemented FSA and HPA modules matching paper description

✅ Proper Loss Functions: utils/lossfun.py implements both intensity and gradient losses as described in paper

✅ Training Loop: Complete with proper backpropagation, optimizer, scheduler, and logging

✅ Evaluation Metrics: All 6 reported metrics (EN, SD, SF, AG, VIF, Q_AB/F) are implemented in utils/Evaluator.py

✅ Data Processing Pipeline: data_process.py provides complete H5 conversion from images

✅ No TODO comments or placeholder functions found

Structural Assessment:

The core implementation appears complete, but there are critical path discrepancies between training and testing scripts that prevent the test script from running.

---

2. RESULTS AUTHENTICITY RED FLAGS

No Evidence of Result Hardcoding ✅

Metrics in utils/Evaluator.py contain proper mathematical implementations
EN (entropy) computed from histogram: return -sum(h * np.log2(h + (h == 0)))
VIF computed through multi-scale analysis (150+ lines of implementation)
Q_AB/F uses Sobel operators and proper edge-based fusion quality assessment
All metrics show actual computation logic, not hardcoded values

Random Seed Management

train.py: Uses torch.backends.cudnn.benchmark = True (line 134)
No explicit random seed setting found
Assessment: Not suspicious; allows natural variation

Training Configuration Matches Paper Claims ✅

Paper claims 300 epochs (line 49 in paper); default in code is 100 but configurable via --numepochs
Learning rate: 1e-4 matches paper (train.py:40)
Batch size: Paper claims 16, code defaults to 2 but is configurable (train.py:43)
Step size: 50 epochs matches paper (train.py:42)
Gamma: 0.5 matches paper (train.py:41)

Note: Discrepancies in epochs/batch size are configurable parameters, suggesting authors may have used different settings than defaults.

---

3. IMPLEMENTATION-PAPER CONSISTENCY

Architecture Verification

#### 3.1 Frequency Strip Attention (FSA) - MATCHES ✅

Paper Description: "Decomposes feature maps into directional frequency components along two orthogonal orientations"
Implementation (new_film.py:54-96):
✅ Stripe-wise average pooling along rows/columns (lines 74-75)
✅ High-frequency extraction via subtraction (lines 87-88, 93)
✅ Learnable weight recombination with parameters (lines 68-71, 90, 94)
✅ Residual gating with β and γ parameters (lines 81-82, 96)
✅ Kernel size 7 matches typical implementation (line 63)

#### 3.2 Hybrid Pooling Attention (HPA) - MATCHES ✅

Paper Description: "Dual attention pathways with average and max pooling"
Implementation (new_film.py:99-171):
✅ Channel grouping (g=4) (line 112)
✅ Average pooling branch (lines 120-121)
✅ Max pooling branch (lines 122-123)
✅ Concatenation and splitting (lines 147-149, 157-158)
✅ Element-wise multiplication for attention (lines 151, 159)

#### 3.3 Text-Guided Cross-Attention - MATCHES ✅

Paper Description: Text tokens (768-dim from BLIP) projected to d=256
Implementation (new_film.py:228-248, 318-321):
✅ Text preprocessing from 768 to hidden_dim (line 240-242)
✅ Cross-attention between text and image tokens (lines 364-365)
✅ Weight derivation through adaptive average pooling (lines 368-371)
✅ L1 normalization (line 369, 371)
✅ Residual feature injection (lines 391-392)

#### 3.4 Hierarchical Architecture - MATCHES ✅

Paper: 3 stacked fusion blocks (N=3)
Implementation (new_film.py:426-464):
✅ Three ImprovedCABlock instances (block1, block2, block3)
✅ Progressive processing with text reuse (lines 502-504)
✅ Final fusion with Restormer blocks (lines 508-511)
✅ Sigmoid activation for [0,1] output (line 513)

#### 3.5 Loss Function - MATCHES ✅

Paper Formula: L_total = L_int + L_grad
L_int = (1/HW)||I_F - max(I_IR, I_VIS)||_1
L_grad = (1/HW)||∇I_F| - max(|∇I_IR|, |∇I_VIS|)||_1
Implementation (utils/lossfun.py:68-88):
✅ Intensity loss using max operation (lines 78-81)
✅ Gradient loss using Sobel operator (lines 82-86)
✅ Proper weighted combination (line 87)
✅ Configurable gradient weight (default 20 in train.py:44, paper claims 10 but configurable)

Hyperparameter Consistency

|-----------|-------|--------------|--------|-------|

| Learning Rate | 1e-4 | 1e-4 | ✅ | train.py:40 |

| Epochs | 300 | 100 | ⚠️ | Configurable via --numepochs |

| Batch Size | 16 | 2 | ⚠️ | Configurable via --batch_size |

| LR Decay Rate | 0.5 | 0.5 | ✅ | train.py:41 |

| Hidden Dim | 256 | 256 | ✅ | train.py:37 |

| Image2Text Dim | 32 | 32 | ✅ | train.py:37 |

Assessment: Core architectural hyperparameters match. Training hyperparameters are configurable, suggesting authors used command-line arguments not shown in default values.

---

4. CODE QUALITY SIGNALS

Positive Indicators:

✅ Clean Implementation: No excessive commented-out code

✅ Proper Documentation: Network modules have detailed docstrings explaining architecture

✅ Modular Design: Clear separation between data loading, model, training, and evaluation

✅ Standard Practices: Uses PyTorch conventions, proper optimizer/scheduler setup

✅ Logging Infrastructure: Comprehensive tensorboard logging and checkpointing

Quality Issues:

⚠️ Import Inconsistency: Test script has wrong import path (critical)

⚠️ Checkpoint Name Mismatch: Test script looks for IVF.pth but checkpoint is ckpt_100.pth

⚠️ Batch Size Discrepancy: Code defaults suggest local testing setup (batch_size=2) not final paper setup

⚠️ Multiple GPU Configuration: Code sets CUDA_VISIBLE_DEVICES = "0, 1, 2, 3, 4, 5, 6, 7" but uses DataParallel suggesting authors had 8-GPU access

Dead Code Analysis:

Minimal dead code found
One commented import: # import kornia in lossfun.py (line 7) - not concerning
Chinese comments present (e.g., # 人工拾取了的道 in lossfun.py) - indicates original development language
Unused evaluation functions in Evaluator.py (MI, MSE, CC, PSNR, SCD, SSIM) but these are part of complete metric library

Import Analysis:

All imports appear legitimate and standard:

✅ torch, torchvision - core deep learning
✅ h5py - HDF5 data format (appropriate for large datasets)
✅ einops - tensor reshaping (standard in transformer implementations)
✅ tensorboardX - logging
✅ opencv, scikit-image - image processing
✅ matplotlib, numpy, scipy - scientific computing

---

5. FUNCTIONALITY INDICATORS

Data Loading - FUNCTIONAL ✅

Implementation: utils/H5_read.py provides proper PyTorch Dataset
Mechanism: Opens H5 files on-demand, returns image pairs + text embeddings
Structure: Returns (imageA, imageB, text, sample_name) tuples
Assessment: Proper implementation, not a placeholder

Training Loop - FUNCTIONAL ✅

Loss Computation: Lines 150-154 in train.py show proper loss calculation
Backpropagation: Lines 155-157 show optimizer.zero_grad(), loss.backward(), optimizer.step()
Gradient Flow: Loss components properly combined and backpropagated
Checkpointing: Lines 207-216 save model state, optimizer state, and scheduler
Assessment: Complete and correct training loop

Evaluation - FUNCTIONAL ✅

Metrics Computed: All 6 metrics (EN, SD, SF, AG, VIF, Q_AB/F) have full implementations
No Shortcuts: VIF implementation is 60+ lines (Evaluator.py:99-164), Q_AB/F is 30+ lines
Proper Statistics: Uses numpy/scipy for statistical computations
Assessment: These are real metric computations, not print statements

Model Checkpoint - EXISTS ✅

File: models/ckpt_100.pth (24MB)
Size Analysis: 24MB is reasonable for 2.08M parameters with optimizer state
Model params: ~2M * 4 bytes = ~8MB
Optimizer state (Adam): ~2× model params = ~16MB
Total expected: ~24MB ✅
Assessment: Checkpoint size is consistent with claimed model architecture

---

6. DEPENDENCY & ENVIRONMENT ISSUES

Requirements Analysis (`requirements.txt`):

einops==0.4.1          ✅ Reasonable version
h5py==3.9.0            ✅ Standard data format library
matplotlib==3.7.2      ✅ Visualization
numpy==1.24.3          ✅ Numerical computing
opencv_python==4.5.3.56 ✅ Image processing
scikit_image==0.19.3   ✅ Image metrics
scikit_learn==1.3.0    ✅ ML utilities (for mutual info)
tensorboardX==2.6.2.2  ✅ Logging
tqdm==4.62.0           ✅ Progress bars
scipy==1.9.3           ✅ Scientific computing
torch==1.8.1+cu111     ⚠️ Specific CUDA version
torchvision==0.9.1+cu111 ⚠️ Matches torch version

Environment Issues:

⚠️ Outdated PyTorch Version: torch 1.8.1 is from March 2021 (3+ years old)

Current stable: torch 2.x
Security implications: older versions may have known vulnerabilities
Compatibility: May cause issues with newer CUDA/GPUs

⚠️ CUDA Specificity: +cu111 indicates CUDA 11.1 requirement

May limit reproducibility on different hardware
Should ideally support CPU fallback

✅ No Conflicting Dependencies: Version combinations appear compatible

✅ All Standard Packages: No proprietary or hard-to-install libraries

Computational Resource Requirements:

⚠️ 8 GPU Configuration: os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1, 2, 3, 4, 5, 6, 7"

Suggests authors had substantial compute resources
May not be reproducible on typical academic hardware
README doesn't mention minimum requirements

⚠️ Batch Size Implications: Default batch_size=2 suggests memory constraints

Paper claims batch_size=16 likely requires multi-GPU setup
Single GPU reproduction may require different batch size

---

7. CONSISTENCY WITH PAPER CLAIMS

Quantitative Results - CANNOT VERIFY

The paper reports specific numerical results (e.g., EN: 6.73 on MSRS). Without:

Original trained checkpoint from 300 epochs
Test datasets
Random seed specification
Complete hyperparameter configuration

We cannot independently verify that this code produces the exact reported numbers. However:

✅ Metric implementations are correct and would produce valid results

✅ Model architecture matches paper description

✅ Training procedure is complete and functional

Ablation Studies - SUPPORTED ✅

Paper reports ablation experiments (Table in paper line 89-95):

Code supports via flags: --use_fsa and --use_hpa (train.py:47-48)
README documents ablation commands (lines 42-56)
Module toggling implemented: nn.Identity() used when disabled (new_film.py:290-296)

Assessment: Ablation experiments are reproducible from provided code.

---

8. CRITICAL BUGS PREVENTING EXECUTION

Bug #1: Test Script Import Error (CRITICAL)

test.py:13
from net.Film import Net  # ❌ File does not exist

Should be:

from new_film import Net  # ✅ Correct location

Bug #2: Checkpoint Path Mismatch (HIGH)

test.py:26
ckpt_path = os.path.join("models", task_name+'.pth')  # Looks for IVF.pth

Available:

models/ckpt_100.pth  # Actual checkpoint name

Bug #3: Missing net/Film.py (CRITICAL)

Train script code saving function (train.py:111-126) attempts to copy net/{model_name}.py
For train.py this works (finds new_film in import)
For test.py this would fail (looks for non-existent net/Film.py)

---

9. EVIDENCE OF AI USAGE

Search for AI Documentation:

Reproducibility Statement PDF found: supplemantary/Reproducibility Statement.pdf
Content not readable via code audit (binary PDF)
No prompt logs, AI conversation traces, or generation documentation found in code files

Code Style Analysis:

Detailed docstrings in new_film.py (lines 1-37, 54-61, 99-108, etc.) with citations like "【393668295883076†L25-L46】"
These citation patterns suggest AI-generated or AI-assisted documentation
The citation format is unusual and appears to be artifact markers from an AI code generation or documentation tool

Code Quality Indicators:

Professional documentation quality - comprehensive docstrings
Consistent naming conventions - follows Python standards
Well-structured modules - good separation of concerns
Citation artifacts - unusual reference format suggests AI involvement

Assessment: While no explicit AI usage log is found in the code, the documentation style and citation artifacts strongly suggest AI assistance was used, particularly for code documentation. However, the core implementation appears to be human-designed based on the paper's architecture.

---

10. REPRODUCIBILITY ASSESSMENT

What CAN Be Reproduced:

✅ Model architecture (FSA, HPA, cross-attention modules)

✅ Training procedure (with data preparation)

✅ Loss functions and optimization

✅ Ablation experiments (FSA/HPA toggling)

✅ Evaluation metrics computation

What CANNOT Be Reproduced Without Fixes:

❌ Testing with provided test.py (import error)

❌ Exact quantitative results (missing data, seed, exact hyperparameters)

❌ 300-epoch trained model (only 100-epoch checkpoint provided)

❌ Immediate execution (data must be prepared per FILM repository)

Required Fixes for Reproducibility:

Fix test.py import: Change from net.Film import Net to from new_film import Net
Fix checkpoint path: Update test.py to use ckpt_100.pth or rename checkpoint to IVF.pth
Document data preparation: Provide clearer instructions or scripts for dataset setup
Specify exact training command: Document the command-line arguments used for paper results
Provide random seed: Specify any random seeds used for reproducibility

---

11. OVERALL ASSESSMENT

Strengths:

Complete model implementation matching paper architecture
Proper training and evaluation infrastructure
Well-documented code with clear module descriptions
Comprehensive metric implementations (not shortcuts)
Support for ablation studies through configuration flags
Reasonable model checkpoint included

Critical Weaknesses:

Test script cannot execute due to import error (critical bug)
Checkpoint naming inconsistency prevents test script from loading model
Missing data files (expected but no sample data provided)
Hyperparameter discrepancies between defaults and paper claims
No explicit reproducibility instructions for exact paper results

Red Flags:

⚠️ Test script was not validated before submission (import error would fail immediately)

⚠️ Training defaults don't match paper (100 vs 300 epochs, batch size 2 vs 16)

⚠️ AI documentation artifacts without explicit acknowledgment

Green Flags:

✅ No hardcoded results - all metrics properly computed

✅ Complete implementation - no TODO or placeholder functions

✅ Proper model checkpoint - size and structure consistent with architecture

✅ Extensive documentation - detailed explanations of modules

---

12. FINAL VERDICT

SEVERITY: MEDIUM Justification:

This codebase represents a substantively complete implementation with the core architecture, training, and evaluation fully implemented. The model matches the paper description, metrics are properly computed (not hardcoded), and a trained checkpoint is provided.

However, critical bugs prevent immediate execution of the test script, and there are concerning inconsistencies between code defaults and paper claims. The test script's import error is particularly problematic as it suggests the submission was not fully validated.

The code is NOT ready for immediate use but IS salvageable with minor fixes. The underlying implementation appears sound, but the submission quality is diminished by:

Incomplete validation/testing
Missing documentation for exact reproduction
Inconsistencies suggesting rushed submission preparation

This falls between HIGH (major gaps) and LOW (minor issues), landing at MEDIUM severity: functional core implementation with critical execution barriers that require user intervention to resolve.

---

AGENT REPRODUCIBILITY: False

Rationale:

No evidence of AI prompts, generation logs, or explicit documentation of AI tool usage was found in the code files. While code documentation shows patterns suggesting AI assistance (unusual citation format), this does not constitute documented agent reproducibility as defined by the evaluation criteria.

The Reproducibility Statement PDF exists but its contents could not be verified through code analysis alone. Without explicit prompt logs or AI usage documentation in accessible text format, the assessment is False.

Audit Report: Paper 23