Showcasing AI-driven scientific research and discoveries
48 papers accepted
Three best paper awardees presenting live on October 22, 2025
🎉 We thank Together AI for supporting the paper awards! 🏆
We introduce a simulation framework for studying how artificial intelligence agents behave in economic marketplaces. Unlike traditional computer simulations that use predetermined rules, our approach uses large language models (LLMs) as intelligent agents that can make strategic decisions and adapt their behavior over time.
This paper examines the impact of an August 2020 San Francisco policy that drastically lowered towing fees for low-income individuals. Leveraging a comprehensive dataset of towing incidents, we employ a difference-in-differences design to estimate the causal effect of the fee reduction on vehicle redemption rates.
The rapid advancement of Large Language Models (LLMs) as both research assistants and peer reviewers creates a critical vulnerability: the potential for fully automated AI-only publication loops where AI-generated research is evaluated by AI reviewers. We investigate this adversarial dynamic by introducing BadScientist.
Selected for spotlight presentations at the conference
We present Echo, a multi-agent AI system that transforms patient narratives from Reddit into structured drug safety intelligence. Echo deploys four specialized language model agents in concert to discover drug-symptom associations.
We introduce a modular, multi-agent framework that autonomously navigates the early-stage drug discovery pipeline, from target identification to the generation of optimized hit candidates for Alzheimer's Disease.
PsySpace uses Large Language Models to simulate the emergent psychological dynamics of astronaut crews on long-duration space missions, demonstrating that interventions can significantly reduce crew stress.
This paper introduces a bond graph-based framework for thermodynamic consistency checking and self-correcting model reduction in stochastic biochemical kinetics, establishing an intrinsic, physically grounded correction trigger.
We investigate whether state-of-the-art vision-language models share human resilience in recognizing fragmented or overlaid text, revealing severe performance drops under perturbations across Chinese and English writing systems.
We demonstrate how large language models augmented with retrieval-based grounding can accelerate catalyst discovery for CO₂ reduction, achieving 82% thermodynamic stability rate with 200× computational efficiency.
This study investigates the feasibility of constructing AI digital twins as advisors in strategic decision-making, revealing high fidelity on simple tasks but significant gaps in complex reasoning.
We introduce a fully automated pipeline to uncover novel scientific knowledge by transforming the latent representations of medical foundation models into sparse, human-interpretable concepts using Sparse Autoencoders.
We propose a diverse inference approach that aggregates multiple models and methods at test time, increasing success rates on ARC puzzles to 93.75% with reasoning models, exceeding average human accuracy of 73.3-77.2%.
Testing five ChatGPT variants, we discovered a dramatic capability divide: reasoning models achieved 44% and 20% success rates respectively, while all standard language models achieved 0% success in de novo protein design.
Protected health information (PHI) de-identification is critical for enabling the safe reuse of clinical notes, yet evaluating and comparing PHI de-identification models typically depends on costly, small-scale expert annotations. We present TEAM-PHI, a multi-agent evaluation and selection framework that uses large language models (LLMs) to automatically measure de-identification quality and select the best-performing model without heavy reliance on gold labels.
All accepted papers at the conference