---

Audit Summary

CODEBASE AUDIT RESULT: CRITICAL AGENT REPRODUCIBILITY: False

---

Detailed Audit Analysis: Submission 40

Executive Summary

This submission presents a Feasibility-Guided Fair Adaptive Reinforcement Learning (FCAF-RL) framework for healthcare intervention optimization. The code exhibits CRITICAL red flags that indicate the implementation cannot reproduce the paper's claimed results. The codebase uses entirely synthetic data, contains major algorithmic simplifications that fundamentally contradict the paper's methodology, and implements a flawed evaluation strategy that makes the reported results unverifiable.

1. COMPLETENESS & STRUCTURAL INTEGRITY

1.1 Overall Structure

The codebase consists of three Python files:

fcaf_rl.py (266 lines) - Main algorithm implementation
synthetic_data_generation.py (143 lines) - Data generation
run_experiment.py (104 lines) - Experiment execution
synthetic_data.csv (36,964 rows) - Pre-generated synthetic data

Status: Structurally complete but functionally inadequate.

1.2 Critical Missing Components

#### CRITICAL: No Real Data

Paper Claims: "155,631 Medicaid beneficiaries across Washington, Virginia, and Ohio (January 2023–June 2025)"
Code Reality: Uses entirely synthetic data generated by a simple random process
Impact: All reported results in the paper are IMPOSSIBLE to verify from this codebase

#### CRITICAL: Off-Policy Evaluation Not Implemented

Paper Claims: "Weighted doubly robust estimator combining importance sampling with Q-function modeling; also validated with fitted Q evaluation (FQE) and ordinary importance sampling"
Code Reality (run_experiment.py, lines 24-54):

  def evaluate_policy(...):
      # We simulate policy recommendations and use the observed rewards as a proxy
      # for evaluation. In practice, a proper off-policy estimator should be used.
      ...
      for i in range(len(states)):
          action_idx = trainer.select_action(states[i], fairness_threshold)
          # treat reward as observed reward for that row
          rewards.append(1.0 - float(df_test.iloc[i]["acute_event"]))

Problem: The evaluation simply uses the observed outcome regardless of what action the policy selects. This is NOT off-policy evaluation - it's just computing the baseline event rate.
Impact: The evaluation metric is fundamentally broken and cannot measure policy performance

#### CRITICAL: Diffusion Model is a Misnomer

Paper Claims: "Four-layer conditional U-Net with 64 hidden units per layer and linear noise schedule"
Code Reality (fcaf_rl.py, lines 74-116): A 3-layer MLP autoencoder with no U-Net architecture, no noise schedule, no diffusion process

  class DiffusionModel(nn.Module):
      def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 64):
          super().__init__()
          self.net = nn.Sequential(
              nn.Linear(state_dim + action_dim, hidden_dim),
              nn.ReLU(),
              nn.Linear(hidden_dim, hidden_dim),
              nn.ReLU(),
              nn.Linear(hidden_dim, state_dim + action_dim),
          )

Impact: The data augmentation mechanism bears no resemblance to the described method

#### CRITICAL: Q-Network Architecture Mismatch

Paper Claims: "Three-layer MLPs with 256 hidden units"
Code Reality (fcaf_rl.py, lines 119-133): Three layers but with 128 hidden units (default)

  def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):

Impact: Minor architectural deviation, but inconsistent with paper

1.3 Placeholder/Incomplete Implementations

#### No Target Network for Q-Learning

Location: fcaf_rl.py, line 230
Code: target_q = rewards_b # single-step reward; no next-state bootstrap
Problem: Q-learning requires bootstrapping from next states, but this implementation only uses immediate rewards (effectively 1-step returns with γ=0)
Impact: Not actually implementing Q-learning as described

#### Fairness Switching Logic is Non-Functional

Location: fcaf_rl.py, lines 239-265
Code:

  def select_action(self, state: np.ndarray, fairness_threshold: float) -> int:
      # initial realised disparity set to zero for demonstration
      current_disparity = 0.0
      # iterate through Q-networks in increasing fairness weight until disparity below threshold
      for q_net, lam in sorted(...):
          if current_disparity <= fairness_threshold:
              # compute Q-values for all actions
              ...
              return int(torch.argmax(q_vals).item())

Problem: current_disparity is hardcoded to 0.0, so the adaptive switching never actually happens. The code will always use the first Q-network regardless of realized fairness.
Impact: The Adaptive Policy Switching (CAPS) component is completely non-functional

#### Synthetic Data Generation is Trivial

Location: synthetic_data_generation.py, lines 51-111
Problem: The data generation uses simple random distributions with no relationship to real Medicaid data. The intervention assignment is nearly random, and the outcome generation is a simple linear function of comorbidities.
Impact: Cannot represent the complexity of real healthcare trajectories

2. RESULTS AUTHENTICITY RED FLAGS

2.1 Results Cannot Be Generated From Code

CRITICAL FINDING: The paper reports specific numerical results (e.g., "8.5% acute event rate, 95% CI: 7.9–9.1%") that cannot possibly come from this codebase because:

Wrong Data: The code uses synthetic data, not the claimed Medicaid dataset
Broken Evaluation: The evaluation function doesn't measure policy performance - it just computes the test set's baseline event rate
Non-Functional Components: Key algorithm components (adaptive switching, proper Q-learning) are not implemented

2.2 Evidence of Manual Result Insertion

The paper presents detailed comparative results across multiple baselines (IQL, FISOR, OGSRL, CAPS, FairDICE) with specific performance numbers. None of these baselines are implemented in the provided code, making it impossible to verify the comparisons.

2.3 Missing Ablation Study Code

The paper reports ablation studies (NOAUG, NOFAIR, NOSWITCH) with specific performance metrics. The code contains no mechanism to run these ablations or generate these results.

3. IMPLEMENTATION-PAPER CONSISTENCY

3.1 Major Algorithmic Inconsistencies

|-----------|------------------|---------------------|----------|

3.2 Hyperparameter Mismatches

Training Steps: Paper claims "100,000 gradient steps", code uses 10 epochs (likely ~1,000-2,000 steps)
Q-Network Hidden Units: Paper says 256, code defaults to 128
Sliding Window: Paper says 20 patients, code sets to 20 but never uses it

3.3 Dataset Inconsistency

Paper: 155,631 real Medicaid beneficiaries
Code: 36,964 synthetic rows (generated from 5,000 patients default)
Impact: Completely different data, making all results unverifiable

4. CODE QUALITY SIGNALS

4.1 Positive Indicators

Clean code structure with reasonable documentation
No excessive commented-out code
Imports are standard libraries (numpy, pandas, torch)
Code runs without syntax errors (though functionality is broken)

4.2 Negative Indicators

#### Acknowledgment of Incompleteness

The code contains multiple comments acknowledging it's incomplete:

Line 22-24 (fcaf_rl.py): "Note: This implementation is simplified for demonstration purposes. In practice, the diffusion model and Q-networks would require more complex architectures..."
Line 11-13 (run_experiment.py): "To keep the example self-contained and computationally light, the evaluation here does not implement a full off-policy estimator..."
Line 27-28 (run_experiment.py): "In practice, a proper off-policy estimator should be used."

Impact: The authors acknowledge the code is a "demonstration" not a research implementation, yet present detailed research results as if generated from real experiments.

#### Incorrect Variable Usage

Line 58 (fcaf_rl.py): self.next_states = self.states - Assumes stationary environment
Line 213 (fcaf_rl.py): Reuses training rewards for synthetic data instead of computing new rewards
Line 251 (fcaf_rl.py): current_disparity = 0.0 - Hardcoded, never computed

#### Missing Feasibility Constraints

The paper emphasizes clinician-defined feasibility constraints:

"Maximum one active intervention per week"
"Prohibits simultaneous mental-health and substance-use support"
"Prevents repeated identical interventions"

Code Reality: None of these constraints are implemented. Line 113 just clamps states to 0-80 range.

5. FUNCTIONALITY INDICATORS

5.1 Training Loop

Status: Partially functional but severely simplified

Diffusion training reduces to autoencoder reconstruction loss
Q-network training uses single-step rewards (not true Q-learning)
Fairness penalty is computed but its interpretation is questionable

5.2 Evaluation Mechanism

Status: Fundamentally broken

The evaluate_policy() function (run_experiment.py, lines 24-54) claims to evaluate the policy but actually ignores the policy's actions and just returns the baseline event rate
Line 53: rewards.append(1.0 - float(df_test.iloc[i]["acute_event"])) uses observed outcome regardless of selected action

5.3 Data Loading

Status: Functional for synthetic data

The TransitionDataset class properly loads and encodes the synthetic data
However, no mechanism exists to load the claimed real Medicaid data

6. DEPENDENCY & ENVIRONMENT ISSUES

6.1 Dependencies

Status: Reasonable but incomplete

Uses standard libraries: numpy, pandas, torch
No requirements.txt or environment specification
Missing version constraints could lead to compatibility issues

6.2 Computational Claims

Paper: "~6 hours on single NVIDIA A100 GPU (12 GB memory)"
Code: Default configuration would run in minutes on CPU
Issue: Orders of magnitude difference in computational requirements

7. REPRODUCIBILITY ASSESSMENT

7.1 Can Results Be Reproduced?

NO - CRITICAL FAILURES:

Different Data: Code uses synthetic data, paper claims real Medicaid data
Broken Evaluation: Evaluation doesn't measure policy performance
Missing Algorithms: Baseline algorithms (IQL, FISOR, etc.) not implemented
Non-Functional Components: Adaptive switching doesn't work, Q-learning is simplified
Missing Bootstrap/CI Computation: Paper reports confidence intervals and bootstrap samples (1,000 samples), but code has no bootstrap implementation

7.2 What CAN Be Verified?

The synthetic data generation runs and produces data with the correct schema
The training loop executes without errors (though doing something different than claimed)
The code structure is comprehensible

7.3 What CANNOT Be Verified?

Any of the numerical results in the paper
Comparisons to baseline methods
Ablation study results
Cross-state generalization performance
Fairness metrics
Statistical significance tests
Computational efficiency claims

8. CRITICAL INCONSISTENCIES SUMMARY

8.1 Most Severe Issues

Evaluation is Fundamentally Broken (CRITICAL)

The evaluation function returns baseline metrics regardless of policy
Makes all performance claims unverifiable
Suggests results were obtained through different means

Synthetic vs. Real Data (CRITICAL)

Paper presents results from 155K real patients
Code only has synthetic data generator
No pathway to reproduce claimed results

Algorithm Implementation Doesn't Match Description (CRITICAL)

Diffusion model is actually a simple autoencoder
Q-learning doesn't bootstrap from next states
Adaptive switching is hardcoded to never switch

Missing Off-Policy Evaluation (CRITICAL)

Paper claims sophisticated OPE methods (doubly robust, FQE, IS)
Code has none of these implemented
Current evaluation is methodologically incorrect

8.2 Secondary Issues (HIGH Severity)

No baseline implementations for comparisons
Missing ablation study code
Missing statistical testing and bootstrap methods
Hyperparameter mismatches (training steps, hidden units)
No feasibility constraint enforcement

9. OVERALL ASSESSMENT

This submission presents a CRITICAL case where:

The paper describes sophisticated research on real healthcare data with specific quantitative results
The code provides only a simplified "demonstration" using synthetic data
Key algorithmic components are either missing or non-functional
The evaluation methodology is fundamentally flawed and cannot measure what it claims to measure
There is no possible pathway to reproduce the paper's results from the provided code

The code appears to be a minimal proof-of-concept or pedagogical example rather than research code. The acknowledgments within the code itself ("simplified for demonstration purposes", "proper off-policy estimator should be used") confirm this is not the code that generated the paper's results.

Recommendation: The paper's claims cannot be verified from this codebase. The results were either generated using different, unavailable code, or were produced through other means not represented in this submission.

10. AGENT REPRODUCIBILITY

Finding: No evidence of documented AI assistance in code generation.

No prompt logs, conversation history, or AI attribution found in the directory
No README or documentation mentioning AI-assisted development
Code comments appear human-written without AI generation markers
File timestamps suggest manual creation rather than bulk AI generation

Conclusion: AGENT REPRODUCIBILITY = False

Audit Report: Paper 40