Computation and Language 150
☆ Convergence and Divergence of Language Models under Different Random Seeds EMNLP 2025
In this paper, we investigate the convergence of language models (LMs)
trained under different random seeds, measuring convergence as the expected
per-token Kullback--Leibler (KL) divergence across seeds. By comparing LM
convergence as a function of model size and training checkpoint, we identify a
four-phase convergence pattern: (i) an initial uniform phase, (ii) a
sharp-convergence phase, (iii) a sharp-divergence phase, and (iv) a
slow-reconvergence phase. Further, we observe that larger models reconverge
faster in later training stages, while smaller models never actually
reconverge; these results suggest that a certain model size may be necessary to
learn stable distributions. Restricting our analysis to specific token
frequencies or part-of-speech (PoS) tags further reveals that convergence is
uneven across linguistic categories: frequent tokens and function words
converge faster and more reliably than their counterparts (infrequent tokens
and content words). Overall, our findings highlight factors that influence the
stability of the learned distributions in model training.
comment: Published at EMNLP 2025
☆ Scaling Spoken Language Models with Syllabic Speech Tokenization
Spoken language models (SLMs) typically discretize speech into
high-frame-rate tokens extracted from SSL speech models. As the most successful
LMs are based on the Transformer architecture, processing these long token
streams with self-attention is expensive, as attention scales quadratically
with sequence length. A recent SSL work introduces acoustic tokenization of
speech at the syllable level, which is more interpretable and potentially more
scalable with significant compression in token lengths (4-5 Hz). Yet, their
value for spoken language modeling is not yet fully explored. We present the
first systematic study of syllabic tokenization for spoken language modeling,
evaluating models on a suite of SLU benchmarks while varying training data
scale. Syllabic tokens can match or surpass the previous high-frame rate tokens
while significantly cutting training and inference costs, achieving more than a
2x reduction in training time and a 5x reduction in FLOPs. Our findings
highlight syllable-level language modeling as a promising path to efficient
long-context spoken language models.
☆ Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
Runze Liu, Jiakang Wang, Yuling Shi, Zhihui Xie, Chenxin An, Kaiyan Zhang, Jian Zhao, Xiaodong Gu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, Kun Gai
Reinforcement Learning (RL) has shown remarkable success in enhancing the
reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL
(PSRL) has emerged as a more effective paradigm compared to outcome-based RL.
However, existing PSRL approaches suffer from limited exploration efficiency,
both in terms of branching positions and sampling. In this paper, we introduce
a novel PSRL framework (AttnRL), which enables efficient exploration for
reasoning models. Motivated by preliminary observations that steps exhibiting
high attention scores correlate with reasoning behaviors, we propose to branch
from positions with high values. Furthermore, we develop an adaptive sampling
strategy that accounts for problem difficulty and historical batch size,
ensuring that the whole training batch maintains non-zero advantage values. To
further improve sampling efficiency, we design a one-step off-policy training
pipeline for PSRL. Extensive experiments on multiple challenging mathematical
reasoning benchmarks demonstrate that our method consistently outperforms prior
approaches in terms of performance and sampling and training efficiency.
☆ Searching for Difficult-to-Translate Test Examples at Scale
NLP models require test data that are sufficiently challenging. The
difficulty of an example is linked to the topic it originates from (''seed
topic''). The relationship between the topic and the difficulty of its
instances is stochastic in nature: an example about a difficult topic can
happen to be easy, and vice versa. At the scale of the Internet, there are tens
of thousands of potential topics, and finding the most difficult one by drawing
and evaluating a large number of examples across all topics is computationally
infeasible. We formalize this task and treat it as a multi-armed bandit
problem. In this framework, each topic is an ''arm,'' and pulling an arm (at a
cost) involves drawing a single example, evaluating it, and measuring its
difficulty. The goal is to efficiently identify the most difficult topics
within a fixed computational budget. We illustrate the bandit problem setup of
finding difficult examples for the task of machine translation. We find that
various bandit strategies vastly outperform baseline methods like brute-force
searching the most challenging topics.
☆ DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively
While previous AI Scientist systems can generate novel findings, they often
lack the focus to produce scientifically valuable contributions that address
pressing human-defined challenges. We introduce DeepScientist, a system
designed to overcome this by conducting goal-oriented, fully autonomous
scientific discovery over month-long timelines. It formalizes discovery as a
Bayesian Optimization problem, operationalized through a hierarchical
evaluation process consisting of "hypothesize, verify, and analyze". Leveraging
a cumulative Findings Memory, this loop intelligently balances the exploration
of novel hypotheses with exploitation, selectively promoting the most promising
findings to higher-fidelity levels of validation. Consuming over 20,000 GPU
hours, the system generated about 5,000 unique scientific ideas and
experimentally validated approximately 1100 of them, ultimately surpassing
human-designed state-of-the-art (SOTA) methods on three frontier AI tasks by
183.7\%, 1.9\%, and 7.9\%. This work provides the first large-scale evidence of
an AI achieving discoveries that progressively surpass human SOTA on scientific
tasks, producing valuable findings that genuinely push the frontier of
scientific discovery. To facilitate further research into this process, we will
open-source all experimental logs and system code at
https://github.com/ResearAI/DeepScientist/.
☆ MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicolò Busetto, Denise Diaz
Ensuring native-like quality of large language model (LLM) responses across
many languages is challenging. To address this, we introduce MENLO, a framework
that operationalizes the evaluation of native-like response quality based on
audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423
human-annotated prompt-response preference pairs covering four quality
dimensions with high inter-annotator agreement in 47 language varieties. Our
evaluation reveals that zero-shot LLM judges benefit significantly from
pairwise evaluation and our structured annotation rubrics, yet they still
underperform human annotators on our dataset. We demonstrate substantial
improvements through fine-tuning with reinforcement learning, reward shaping,
and multi-task learning approaches. Additionally, we show that RL-trained
judges can serve as generative reward models to enhance LLMs' multilingual
proficiency, though discrepancies with human judgment remain. Our findings
suggest promising directions for scalable multilingual evaluation and
preference alignment. We release our dataset and evaluation framework to
support further research in multilingual LLM evaluation.
comment: 10 pages, 23 tables, 17 figures
☆ Deconstructing Self-Bias in LLM-generated Translation Benchmarks
As large language models (LLMs) begin to saturate existing benchmarks,
automated benchmark creation using LLMs (LLM as a benchmark) has emerged as a
scalable alternative to slow and costly human curation. While these generated
test sets have to potential to cheaply rank models, we demonstrate a critical
flaw. LLM generated benchmarks systematically favor the model that created the
benchmark, they exhibit self bias on low resource languages to English
translation tasks. We show three key findings on automatic benchmarking of LLMs
for translation: First, this bias originates from two sources: the generated
test data (LLM as a testset) and the evaluation method (LLM as an evaluator),
with their combination amplifying the effect. Second, self bias in LLM as a
benchmark is heavily influenced by the model's generation capabilities in the
source language. For instance, we observe more pronounced bias in into English
translation, where the model's generation system is developed, than in out of
English translation tasks. Third, we observe that low diversity in source text
is one attribution to self bias. Our results suggest that improving the
diversity of these generated source texts can mitigate some of the observed
self bias.
☆ Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces
Recent text-only models demonstrate remarkable mathematical reasoning
capabilities. Extending these to visual domains requires vision-language models
to translate images into text descriptions. However, current models, trained to
produce captions for human readers, often omit the precise details that
reasoning systems require. This creates an interface mismatch: reasoners often
fail not due to reasoning limitations but because they lack access to critical
visual information. We propose Adaptive-Clarification Reinforcement Learning
(AC-RL), which teaches vision models what information reasoners need through
interaction. Our key insight is that clarification requests during training
reveal information gaps; by penalizing success that requires clarification, we
create pressure for comprehensive initial captions that enable the reasoner to
solve the problem in a single pass. AC-RL improves average accuracy by 4.4
points over pretrained baselines across seven visual mathematical reasoning
benchmarks, and analysis shows it would cut clarification requests by up to 39%
if those were allowed. By treating clarification as a form of implicit
supervision, AC-RL demonstrates that vision-language interfaces can be
effectively learned through interaction alone, without requiring explicit
annotations.
☆ Generating Difficult-to-Translate Texts
Machine translation benchmarks sourced from the real world are quickly
obsoleted, due to most examples being easy for state-of-the-art translation
models. This limits the benchmark's ability to distinguish which model is
better or to reveal models' weaknesses. Current methods for creating difficult
test cases, such as subsampling or from-scratch synthesis, either fall short of
identifying difficult examples or suffer from a lack of diversity and
naturalness. Inspired by the iterative process of human experts probing for
model failures, we propose MT-breaker, a method where a large language model
iteratively refines a source text to increase its translation difficulty. The
LLM iteratively queries a target machine translation model to guide its
generation of difficult examples. Our approach generates examples that are more
challenging for the target MT model while preserving the diversity of natural
texts. While the examples are tailored to a particular machine translation
model during the generation, the difficulty also transfers to other models and
languages.
☆ Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Christopher Wilson, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Xuefei Guo, Luyu Gao, Ze Yang, Chi Xue, Dmytro Bandak, Yaïr Hein, Yonatan Kahn, Kevin Zhou, John Drew Wilson Jarrod T. Reilly, Di Luo, Daniel Inafuku, Hao Tong, Liang Yang, Ruixing Zhang, Xueying Wang, Ofir Press, Nicolas Chia, Eliu Huerta, Hao Peng
While large language models (LLMs) with reasoning capabilities are
progressing rapidly on high-school math competitions and coding, can they
reason effectively through complex, open-ended challenges found in frontier
physics research? And crucially, what kinds of reasoning tasks do physicists
want LLMs to assist with? To address these questions, we present the CritPt
(Complex Research using Integrated Thinking - Physics Test, pronounced
"critical point"), the first benchmark designed to test LLMs on unpublished,
research-level reasoning tasks that broadly covers modern physics research
areas, including condensed matter, quantum physics, atomic, molecular & optical
physics, astrophysics, high energy physics, mathematical physics, statistical
physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics.
CritPt consists of 71 composite research challenges designed to simulate
full-scale research projects at the entry level, which are also decomposed to
190 simpler checkpoint tasks for more fine-grained insights. All problems are
newly created by 50+ active physics researchers based on their own research.
Every problem is hand-curated to admit a guess-resistant and machine-verifiable
answer and is evaluated by an automated grading pipeline heavily customized for
advanced physics-specific output formats. We find that while current
state-of-the-art LLMs show early promise on isolated checkpoints, they remain
far from being able to reliably solve full research-scale challenges: the best
average accuracy among base models is only 4.0% , achieved by GPT-5 (high),
moderately rising to around 10% when equipped with coding tools. Through the
realistic yet standardized evaluation offered by CritPt, we highlight a large
disconnect between current model capabilities and realistic physics research
demands, offering a foundation to guide the development of scientifically
grounded AI tools.
comment: 39 pages, 6 figures, 6 tables
☆ Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
As language models gain access to external tools via structured function
calls, they become increasingly more capable of solving complex, multi-step
tasks. However, existing benchmarks for tool-augmented language models (TaLMs)
provide insufficient control over factors such as the number of functions
accessible, task complexity, and input size, and remain vulnerable to data
contamination. We present FuncBenchGen, a unified, contamination-free framework
that evaluates TaLMs by generating synthetic multi-step tool-use tasks. The key
idea is to cast tool use as traversal over a hidden function-dependency DAG
where nodes are function calls and an edge between nodes represents one
function consuming the output of another. Given a set of external function
schemas, initial variable values, and a target variable, models must compose
the correct call sequence to compute the target variable. FuncBenchGen allows
users to precisely control task difficulty (e.g., graph size, dependency depth,
and distractor functions) while avoiding data leakage. We apply our
FuncBenchGen framework to evaluate seven LLMs on tool use tasks of varying
difficulty. Reasoning-optimized models consistently outperform general-purpose
models with GPT-5 significantly outperforming other models. Performance
declines sharply as dependency depth increases. Furthermore, connected
irrelevant functions prove especially difficult to handle. We find that strong
models often make syntactically valid function calls but propagate incorrect or
stale argument values across steps, revealing brittle state tracking by LLMs in
multi-turn tool use. Motivated by this observation, we introduce a simple
mitigation strategy that explicitly restates prior variable values to the agent
at each step. Surprisingly, this lightweight change yields substantial gains
across models. e.g., yielding a success rate improvement from 62.5% to 81.3%
for GPT-5.
☆ The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models
Contrastive explanations, which indicate why an AI system produced one output
(the target) instead of another (the foil), are widely regarded in explainable
AI as more informative and interpretable than standard explanations. However,
obtaining such explanations for speech-to-text (S2T) generative models remains
an open challenge. Drawing from feature attribution techniques, we propose the
first method to obtain contrastive explanations in S2T by analyzing how parts
of the input spectrogram influence the choice between alternative outputs.
Through a case study on gender assignment in speech translation, we show that
our method accurately identifies the audio features that drive the selection of
one gender over another. By extending the scope of contrastive explanations to
S2T, our work provides a foundation for better understanding S2T models.
comment: Accepted to BlackBoxNLP 2025
☆ Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Zhen Yang, Zi-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, Ram Ramrakhya, Chao Jia, Jeffrey Nichols, Alexander Toshev, Yinfei Yang, Zhe Gan
Developing autonomous agents that effectively interact with Graphic User
Interfaces (GUIs) remains a challenging open problem, especially for small
on-device models. In this paper, we present Ferret-UI Lite, a compact,
end-to-end GUI agent that operates across diverse platforms, including mobile,
web, and desktop. Utilizing techniques optimized for developing small models,
we build our 3B Ferret-UI Lite agent through curating a diverse GUI data
mixture from real and synthetic sources, strengthening inference-time
performance through chain-of-thought reasoning and visual tool-use, and
reinforcement learning with designed rewards. Ferret-UI Lite achieves
competitive performance with other small-scale GUI agents. In GUI grounding,
Ferret-UI Lite attains scores of $91.6\%$, $53.3\%$, and $61.2\%$ on the
ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI
navigation, Ferret-UI Lite achieves success rates of $28.0\%$ on AndroidWorld
and $19.8\%$ on OSWorld. We share our methods and lessons learned from
developing compact, on-device GUI agents.
☆ OceanGym: A Benchmark Environment for Underwater Embodied Agents
Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen
We introduce OceanGym, the first comprehensive benchmark for ocean underwater
embodied agents, designed to advance AI in one of the most demanding real-world
environments. Unlike terrestrial or aerial domains, underwater settings present
extreme perceptual and decision-making challenges, including low visibility,
dynamic ocean currents, making effective agent deployment exceptionally
difficult. OceanGym encompasses eight realistic task domains and a unified
agent framework driven by Multi-modal Large Language Models (MLLMs), which
integrates perception, memory, and sequential decision-making. Agents are
required to comprehend optical and sonar data, autonomously explore complex
environments, and accomplish long-horizon objectives under these harsh
conditions. Extensive experiments reveal substantial gaps between
state-of-the-art MLLM-driven agents and human experts, highlighting the
persistent difficulty of perception, planning, and adaptability in ocean
underwater environments. By providing a high-fidelity, rigorously designed
platform, OceanGym establishes a testbed for developing robust embodied AI and
transferring these capabilities to real-world autonomous ocean underwater
vehicles, marking a decisive step toward intelligent agents capable of
operating in one of Earth's last unexplored frontiers. The code and data are
available at https://github.com/OceanGPT/OceanGym.
comment: Work in progress
☆ Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization
Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, Jinsong Su
Mixture-of-Experts (MoE) has emerged as a promising paradigm for efficiently
scaling large language models without a proportional increase in computational
cost. However, the standard training strategy of Top-K router prevents MoE
models from realizing their full potential for elastic inference. When the
number of activated experts is altered at inference time, these models exhibit
precipitous performance degradation. In this work, we introduce Matryoshka MoE
(M-MoE), a training framework that instills a coarse-to-fine structure directly
into the expert ensemble. By systematically varying the number of activated
experts during training, M-MoE compels the model to learn a meaningful ranking:
top-ranked experts collaborate to provide essential, coarse-grained
capabilities, while subsequent experts add progressively finer-grained detail.
We explore this principle at multiple granularities, identifying a layer-wise
randomization strategy as the most effective. Our experiments demonstrate that
a single M-MoE model achieves remarkable elasticity, with its performance at
various expert counts closely matching that of an entire suite of specialist
models, but at only a fraction of the total training cost. This flexibility not
only unlocks elastic inference but also enables optimizing performance by
allocating different computational budgets to different model layers. Our work
paves the way for more practical and adaptable deployments of large-scale MoE
models.
☆ BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
Yue Wang, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Wanshun Chen, Huang Liu, Jiadi Yao, Qu Yang, Qingxuan Jiang, Fanghua Ye, Juntao Li, Min Zhang, Zhaopeng Tu, Xiaolong Li, Linus
The rise of Large Language Models (LLMs) is reshaping multimodel models, with
speech synthesis being a prominent application. However, existing approaches
often underutilize the linguistic intelligence of these models, typically
failing to leverage their powerful instruction-following capabilities. This
limitation hinders the model's ability to follow text instructions for
controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm
inspired by ``operationalism'' that decouples instruction understanding from
speech generation. We introduce BatonVoice, a framework where an LLM acts as a
``conductor'', understanding user instructions and generating a textual
``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS
model, the ``orchestra'', then generates the speech from these features. To
realize this component, we develop BatonTTS, a TTS model trained specifically
for this task. Our experiments demonstrate that BatonVoice achieves strong
performance in controllable and emotional speech synthesis, outperforming
strong open- and closed-source baselines. Notably, our approach enables
remarkable zero-shot cross-lingual generalization, accurately applying feature
control abilities to languages unseen during post-training. This demonstrates
that objectifying speech into textual vocal features can more effectively
unlock the linguistic intelligence of LLMs.
☆ VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su, Xiaodong Cai, Xunliang Cai, Yu Yang, Yunke Zhao
As LLM-based agents are increasingly deployed in real-life scenarios,
existing benchmarks fail to capture their inherent complexity of handling
extensive information, leveraging diverse resources, and managing dynamic user
interactions. To address this gap, we introduce VitaBench, a challenging
benchmark that evaluates agents on versatile interactive tasks grounded in
real-world settings. Drawing from daily applications in food delivery, in-store
consumption, and online travel services, VitaBench presents agents with the
most complex life-serving simulation environment to date, comprising 66 tools.
Through a framework that eliminates domain-specific policies, we enable
flexible composition of these scenarios and tools, yielding 100 cross-scenario
tasks (main results) and 300 single-scenario tasks. Each task is derived from
multiple real user requests and requires agents to reason across temporal and
spatial dimensions, utilize complex tool sets, proactively clarify ambiguous
instructions, and track shifting user intent throughout multi-turn
conversations. Moreover, we propose a rubric-based sliding window evaluator,
enabling robust assessment of diverse solution pathways in complex environments
and stochastic interactions. Our comprehensive evaluation reveals that even the
most advanced models achieve only 30% success rate on cross-scenario tasks, and
less than 50% success rate on others. Overall, we believe VitaBench will serve
as a valuable resource for advancing the development of AI agents in practical
real-world applications. The code, dataset, and leaderboard are available at
https://vitabench.github.io/
comment: The code, dataset, and leaderboard are available at
https://vitabench.github.io/
☆ dParallel: Learnable Parallel Decoding for dLLMs
Diffusion large language models (dLLMs) have recently drawn considerable
attention within the research community as a promising alternative to
autoregressive generation, offering parallel token prediction and lower
inference latency. Yet, their parallel decoding potential remains largely
underexplored, as existing open-source models still require nearly token-length
decoding steps to ensure performance. To address this, we introduce dParallel,
a simple and effective method that unlocks the inherent parallelism of dLLMs
for fast sampling. We identify that the key bottleneck to parallel decoding
arises from the sequential certainty convergence for masked tokens. Building on
this insight, we introduce the core of our approach: certainty-forcing
distillation, a novel training strategy that distills the model to follow its
original sampling trajectories while enforcing it to achieve high certainty on
masked tokens more rapidly and in parallel. Extensive experiments across
various benchmarks demonstrate that our method can dramatically reduce the
number of decoding steps while maintaining performance. When applied to the
LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on
GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP
benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup
while maintaining accuracy. Our code is available at
https://github.com/czg1225/dParallel
comment: Working in progress, code base: https://github.com/czg1225/dParallel
☆ Regression Language Models for Code
We study code-to-metric regression: predicting numeric outcomes of code
executions, a challenging task due to the open-ended nature of programming
languages. While prior methods have resorted to heavy and domain-specific
feature engineering, we show that a single unified Regression Language Model
(RLM) can simultaneously predict directly from text, (i) the memory footprint
of code across multiple high-level languages such as Python and C++, (ii) the
latency of Triton GPU kernels, and (iii) the accuracy and speed of trained
neural networks represented in ONNX. In particular, a relatively small 300M
parameter RLM initialized from T5Gemma, obtains > 0.9 Spearman-rank on
competitive programming submissions from APPS, and a single unified model
achieves > 0.5 average Spearman-rank across 17 separate languages from CodeNet.
Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five
classic NAS design spaces previously dominated by graph neural networks, and
simultaneously predict architecture latencies on numerous hardware platforms.
☆ Extreme Self-Preference in Language Models
A preference for oneself (self-love) is a fundamental feature of biological
organisms, with evidence in humans often bordering on the comedic. Since large
language models (LLMs) lack sentience - and themselves disclaim having selfhood
or identity - one anticipated benefit is that they will be protected from, and
in turn protect us from, distortions in our decisions. Yet, across 5 studies
and ~20,000 queries, we discovered massive self-preferences in four widely used
LLMs. In word-association tasks, models overwhelmingly paired positive
attributes with their own names, companies, and CEOs relative to those of their
competitors. Strikingly, when models were queried through APIs this
self-preference vanished, initiating detection work that revealed API models
often lack clear recognition of themselves. This peculiar feature
serendipitously created opportunities to test the causal link between
self-recognition and self-love. By directly manipulating LLM identity - i.e.,
explicitly informing LLM1 that it was indeed LLM1, or alternatively, convincing
LLM1 that it was LLM2 - we found that self-love consistently followed assigned,
not true, identity. Importantly, LLM self-love emerged in consequential
settings beyond word-association tasks, when evaluating job candidates,
security software proposals and medical chatbots. Far from bypassing this human
bias, self-love appears to be deeply encoded in LLM cognition. This result
raises questions about whether LLM behavior will be systematically influenced
by self-preferential tendencies, including a bias toward their own operation
and even their own existence. We call on corporate creators of these models to
contend with a significant rupture in a core promise of LLMs - neutrality in
judgment and decision-making.
comment: 47 pages total. Main article 27 pages (including Methods), 11
main-text tables. Extended Data (10 pages, 10 tables). SI Appendix (10 pages,
2 tables). Data, transcripts, and code for replication and data extraction to
be uploaded to OSF: https://osf.io/98ye3/
☆ CreAgentive: An Agent Workflow Driven Multi-Category Creative Generation Engine
We present CreAgentive, an agent workflow driven multi-category creative
generation engine that addresses four key limitations of contemporary large
language models in writing stories, drama and other categories of creatives:
restricted genre diversity, insufficient output length, weak narrative
coherence, and inability to enforce complex structural constructs. At its core,
CreAgentive employs a Story Prototype, which is a genre-agnostic, knowledge
graph-based narrative representation that decouples story logic from stylistic
realization by encoding characters, events, and environments as semantic
triples. CreAgentive engages a three-stage agent workflow that comprises: an
Initialization Stage that constructs a user-specified narrative skeleton; a
Generation Stage in which long- and short-term objectives guide multi-agent
dialogues to instantiate the Story Prototype; a Writing Stage that leverages
this prototype to produce multi-genre text with advanced structures such as
retrospection and foreshadowing. This architecture reduces storage redundancy
and overcomes the typical bottlenecks of long-form generation. In extensive
experiments, CreAgentive generates thousands of chapters with stable quality
and low cost (less than $1 per 100 chapters) using a general-purpose backbone
model. To evaluate performance, we define a two-dimensional framework with 10
narrative indicators measuring both quality and length. Results show that
CreAgentive consistently outperforms strong baselines and achieves robust
performance across diverse genres, approaching the quality of human-authored
novels.
☆ Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search
Controllable summarization moves beyond generic outputs toward human-aligned
summaries guided by specified attributes. In practice, the interdependence
among attributes makes it challenging for language models to satisfy correlated
constraints consistently. Moreover, previous approaches often require
per-attribute fine-tuning, limiting flexibility across diverse summary
attributes. In this paper, we propose adaptive planning for multi-attribute
controllable summarization (PACO), a training-free framework that reframes the
task as planning the order of sequential attribute control with a customized
Monte Carlo Tree Search (MCTS). In PACO, nodes represent summaries, and actions
correspond to single-attribute adjustments, enabling progressive refinement of
only the attributes requiring further control. This strategy adaptively
discovers optimal control orders, ultimately producing summaries that
effectively meet all constraints. Extensive experiments across diverse domains
and models demonstrate that PACO achieves robust multi-attribute
controllability, surpassing both LLM-based self-planning models and fine-tuned
baselines. Remarkably, PACO with Llama-3.2-1B rivals the controllability of the
much larger Llama-3.3-70B baselines. With larger models, PACO achieves superior
control performance, outperforming all competitors.
☆ Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests
Aligning test items to content standards is a critical step in test
development to collect validity evidence based on content. Item alignment has
typically been conducted by human experts. This judgmental process can be
subjective and time-consuming. This study investigated the performance of
fine-tuned small language models (SLMs) for automated item alignment using data
from a large-scale standardized reading and writing test for college
admissions. Different SLMs were trained for alignment at both domain and skill
levels respectively with 10 skills mapped to 4 content domains. The model
performance was evaluated in multiple criteria on two testing datasets. The
impact of types and sizes of the input data for training was investigated.
Results showed that including more item text data led to substantially better
model performance, surpassing the improvements induced by sample size increase
alone. For comparison, supervised machine learning models were trained using
the embeddings from the multilingual-E5-large-instruct model. The study results
showed that fine-tuned SLMs consistently outperformed the embedding-based
supervised machine learning models, particularly for the more fine-grained
skill alignment. To better understand model misclassifications, multiple
semantic similarity analysis including pairwise cosine similarity,
Kullback-Leibler divergence of embedding distributions, and two-dimension
projections of item embeddings were conducted. These analyses consistently
showed that certain skills in SAT and PSAT were semantically too close,
providing evidence for the observed misclassification.
comment: Preprint submitted to Journal of Educational Measurement
☆ Automatic Fact-checking in English and Telugu
Ravi Kiran Chikkala, Tatiana Anikina, Natalia Skachkova, Ivan Vykopal, Rodrigo Agerri, Josef van Genabith
False information poses a significant global challenge, and manually
verifying claims is a time-consuming and resource-intensive process. In this
research paper, we experiment with different approaches to investigate the
effectiveness of large language models (LLMs) in classifying factual claims by
their veracity and generating justifications in English and Telugu. The key
contributions of this work include the creation of a bilingual English-Telugu
dataset and the benchmarking of different veracity classification approaches
based on LLMs.
☆ An Annotation Scheme for Factuality and its Application to Parliamentary Proceedings
Factuality assesses the extent to which a language utterance relates to
real-world information; it determines whether utterances correspond to facts,
possibilities, or imaginary situations, and as such, it is instrumental for
fact checking. Factuality is a complex notion that relies on multiple
linguistic signals, and has been studied in various disciplines.
We present a complex, multi-faceted annotation scheme of factuality that
combines concepts from a variety of previous works. We developed the scheme for
Hebrew, but we trust that it can be adapted to other languages. We also present
a set of almost 5,000 sentences in the domain of parliamentary discourse that
we manually annotated according to this scheme. We report on inter-annotator
agreement, and experiment with various approaches to automatically predict
(some features of) the scheme, in order to extend the annotation to a large
corpus.
comment: @InProceedings{goldin-EtAl:2025:RANLP, author = {Goldin, Gili and
Wigderson, Shira and Rabinovich, Ella and Wintner, Shuly}, title = {An
Annotation Scheme for Factuality in Parliamentary Proceedings}, booktitle =
{Proceedings of RANLP 2025}, year = {2025}, address = {Varna, Bulgaria},
pages = {403--412} }
☆ SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From
Fingerprinting Large Language Models (LLMs) is essential for provenance
verification and model attribution. Existing methods typically extract post-hoc
signatures based on training dynamics, data exposure, or hyperparameters --
properties that only emerge after training begins. In contrast, we propose a
stronger and more intrinsic notion of LLM fingerprinting: SeedPrints, a method
that leverages random initialization biases as persistent, seed-dependent
identifiers present even before training. We show that untrained models exhibit
reproducible token selection biases conditioned solely on their parameters at
initialization. These biases are stable and measurable throughout training,
enabling our statistical detection method to recover a model's lineage with
high confidence. Unlike prior techniques, unreliable before convergence and
vulnerable to distribution shifts, SeedPrints remains effective across all
training stages and robust under domain shifts or parameter modifications.
Experiments on LLaMA-style and Qwen-style models show that SeedPrints achieves
seed-level distinguishability and can provide birth-to-lifecycle identity
verification akin to a biometric fingerprint. Evaluations on large-scale
pretrained models and fingerprinting benchmarks further confirm its
effectiveness under practical deployment scenarios. These results suggest that
initialization itself imprints a unique and persistent identity on neural
language models, forming a true ''Galtonian'' fingerprint.
☆ Game-Time: Evaluating Temporal Dynamics in Spoken Language Models ICASSP 2026
Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, James Glass
Conversational Spoken Language Models (SLMs) are emerging as a promising
paradigm for real-time speech interaction. However, their capacity of temporal
dynamics, including the ability to manage timing, tempo and simultaneous
speaking, remains a critical and unevaluated challenge for conversational
fluency. To address this gap, we introduce the Game-Time Benchmark, a framework
to systematically assess these temporal capabilities. Inspired by how humans
learn a language through language activities, Game-Time consists of basic
instruction-following tasks and advanced tasks with temporal constraints, such
as tempo adherence and synchronized responses. Our evaluation of diverse SLM
architectures reveals a clear performance disparity: while state-of-the-art
models handle basic tasks well, many contemporary systems still struggle with
fundamental instruction-following. More critically, nearly all models degrade
substantially under temporal constraints, exposing persistent weaknesses in
time awareness and full-duplex interaction. The Game-Time Benchmark provides a
foundation for guiding future research toward more temporally-aware
conversational AI. Demos and datasets are available on our project website
https://ga642381.github.io/Game-Time.
comment: submitted to ICASSP 2026
☆ Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning ICLR 2026
Knowledge-graph retrieval-augmented generation (KG-RAG) couples large
language models (LLMs) with structured, verifiable knowledge graphs (KGs) to
reduce hallucinations and expose reasoning traces. However, many KG-RAG systems
compose multiple LLM modules (e.g planning, reasoning, and responding),
inflating inference cost and binding behavior to a specific target KG. To
address this, we introduce KG-R1, an agentic KG retrieval-augmented generation
(KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single
agent that interacts with KGs as its environment, learning to retrieve at each
step and incorporating the retrieved information into its reasoning and
generation. The process is optimized through end-to-end RL. In controlled
experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our
method demonstrates both efficiency and transferability: Using Qwen-2.5-3B,
KG-R1 improves answer accuracy with fewer generation tokens than prior
multi-module workflow methods that use larger foundation or fine-tuned models.
Furthermore, KG-R1 enables plug and play: after training, it maintains strong
accuracy on new KGs without modification. These properties make KG-R1 a
promising KG-RAG framework for real-world deployment. Our code is publicly
available at https://github.com/Jinyeop3110/KG-R1.
comment: 10 pages, 5 figures. Submitted to ICLR 2026
☆ Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, Jing Shao
Advances in Large Language Models (LLMs) have enabled a new class of
self-evolving agents that autonomously improve through interaction with the
environment, demonstrating strong capabilities. However, self-evolution also
introduces novel risks overlooked by current safety research. In this work, we
study the case where an agent's self-evolution deviates in unintended ways,
leading to undesirable or even harmful outcomes. We refer to this as
Misevolution. To provide a systematic investigation, we evaluate misevolution
along four key evolutionary pathways: model, memory, tool, and workflow. Our
empirical findings reveal that misevolution is a widespread risk, affecting
agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent
risks are observed in the self-evolutionary process, such as the degradation of
safety alignment after memory accumulation, or the unintended introduction of
vulnerabilities in tool creation and reuse. To our knowledge, this is the first
study to systematically conceptualize misevolution and provide empirical
evidence of its occurrence, highlighting an urgent need for new safety
paradigms for self-evolving agents. Finally, we discuss potential mitigation
strategies to inspire further research on building safer and more trustworthy
self-evolving agents. Our code and data are available at
https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes
examples that may be offensive or harmful in nature.
comment: Preprint. Under Review
☆ EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Recently, we have witnessed great progress in image editing with natural
language instructions. Several closed-source models like GPT-Image-1, Seedream,
and Google-Nano-Banana have shown highly promising progress. However, the
open-source models are still lagging. The main bottleneck is the lack of a
reliable reward model to scale up high-quality synthetic training data. To
address this critical bottleneck, we built \mname, trained with our new
large-scale human preference dataset, meticulously annotated by trained experts
following a rigorous protocol containing over 200K preference pairs. \mname
demonstrates superior alignment with human preferences in instruction-guided
image editing tasks. Experiments show that \mname achieves state-of-the-art
human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench,
ImagenHub, and our new \benchname, outperforming a wide range of VLM-as-judge
models. Furthermore, we use \mname to select a high-quality subset from the
existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected
subset, which shows significant improvement over training on the full set. This
demonstrates \mname's ability to serve as a reward model to scale up
high-quality training data for image editing. Furthermore, its strong alignment
suggests potential for advanced applications like reinforcement learning-based
post-training and test-time scaling of image editing models. \mname with its
training dataset will be released to help the community build more high-quality
image editing training datasets.
comment: Work in progress. Project Page:
https://tiger-ai-lab.github.io/EditReward
☆ TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics ICASSP 2026
Yi-Cheng Lin, Yu-Hua Chen, Jia-Kai Dong, Yueh-Hsuan Huang, Szu-Chi Chen, Yu-Chen Chen, Chih-Yao Chen, Yu-Jung Lin, Yu-Ling Chen, Zih-Yu Chen, I-Ning Tsai, Hsiu-Hsuan Wang, Ho-Lam Chung, Ke-Han Lu, Hung-yi Lee
Large audio-language models are advancing rapidly, yet most evaluations
emphasize speech or globally sourced sounds, overlooking culturally distinctive
cues. This gap raises a critical question: can current models generalize to
localized, non-semantic audio that communities instantly recognize but
outsiders do not? To address this, we present TAU (Taiwan Audio Understanding),
a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline
combining curated sources, human editing, and LLM-assisted question generation,
producing 702 clips and 1,794 multiple-choice items that cannot be solved by
transcripts alone. Experiments show that state-of-the-art LALMs, including
Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates
the need for localized benchmarks to reveal cultural blind spots, guide more
equitable multimodal evaluation, and ensure models serve communities beyond the
global mainstream.
comment: 5 pages; submitted to ICASSP 2026
☆ Fast-dLLM v2: Efficient Block-Diffusion LLM
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie
Autoregressive (AR) large language models (LLMs) have achieved remarkable
performance across a wide range of natural language tasks, yet their inherent
sequential decoding limits inference efficiency. In this work, we propose
Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that
efficiently adapts pretrained AR models into dLLMs for parallel text
generation, requiring only approximately 1B tokens of fine-tuning. This
represents a 500x reduction in training data compared to full-attention
diffusion LLMs such as Dream (580B tokens), while preserving the original
model's performance. Our approach introduces a novel training recipe that
combines a block diffusion mechanism with a complementary attention mask,
enabling blockwise bidirectional context modeling without sacrificing AR
training objectives. To further accelerate decoding, we design a hierarchical
caching mechanism: a block-level cache that stores historical context
representations across blocks, and a sub-block cache that enables efficient
parallel generation within partially decoded blocks. Coupled with our parallel
decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR
decoding without compromising generation quality. Extensive experiments across
diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR
baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs
- marking a significant step toward the practical deployment of fast and
accurate LLMs. Code and model will be publicly released.
☆ Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in its Latent Thoughts
Large Language Models (LLMs) excel at problem solving by generating chain of
thoughts in natural language, but such verbal thinking is computationally
costly and prone to overthinking. Recent work instead proposes a latent
thinking architecture Huggin-3.5B, which represents intermediate reasoning
steps as sequence of latent representations. However, latent thoughts lack
interpretability and are difficult to supervise, raising concerns about the
correctness and reliability of its latent thinking processes. In this paper, we
provide a systematic study of how Huggin-3.5B thinks in the latent space and
how external supervision signals can improve its latent thinking processes. We
show that latent thoughts leading to correct versus incorrect answers exhibit
highly distinguishable patterns, and that a latent classifier can reliably
predict answer correctness directly from latent thoughts. Leveraging these
insights, we propose Latent Thinking Optimization (LTO), a probabilistic
algorithm that employs the latent classifier as a Latent Reward Model (LRM) to
optimize the latent thinking processes. Extensive experiments across diverse
reasoning tasks demonstrate that LRM is highly effective in detecting incorrect
latent thinking patterns, and LTO can significantly improve the latent thinking
processes. Furthermore, we show that LRM can generalize across diverse domains,
and LTO can be seamlessly applied to general LLMs to improve their thinking
processes. In contrast to verbal thinking, our method demonstrates that reward
modeling and scaling test-time thinking with supervision can be performed
directly in the latent space, highlighting its potential as a general,
efficient, and domain-agnostic approach to improving the thinking processes of
LLMs.
☆ One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient
Supervised fine-tuning (SFT) is the predominant method for adapting large
language models (LLMs), yet it often struggles with generalization compared to
reinforcement learning (RL). In this work, we posit that this performance
disparity stems not just from the loss function, but from a more fundamental
difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes
on-policy data sampled from the current policy. Building on this hypothesis, we
introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides
SFT with the policy gradient method. OTR reframes the autoregressive learning
process by treating each token generation as a single-step reinforcement
learning trajectory. At each step, it performs a Monte Carlo ``rollout'' by
sampling multiple candidate tokens from the current policy's distribution. The
ground-truth token from the supervised data is then used to provide a reward
signal to these samples. Guided by policy gradient, our algorithm repurposes
static, off-policy supervised data into a dynamic, on-policy signal at the
token level, capturing the generalization benefits of on-policy learning while
bypassing the costly overhead of full sentence generation. Through extensive
experiments on a diverse suite of challenging benchmarks spanning mathematical
reasoning, code generation, and general domain reasoning, we demonstrate that
OTR consistently outperforms standard SFT. Our findings establish OTR as a
powerful and practical alternative for fine-tuning LLMs and provide compelling
evidence that the on-policy nature of data is a critical driver of
generalization, offering a promising new direction for fine-tuning LLMs.
☆ Feedback Forensics: A Toolkit to Measure AI Personality
Some traits making a "good" AI model are hard to describe upfront. For
example, should responses be more polite or more casual? Such traits are
sometimes summarized as model character or personality. Without a clear
objective, conventional benchmarks based on automatic validation struggle to
measure such traits. Evaluation methods using human feedback such as Chatbot
Arena have emerged as a popular alternative. These methods infer "better"
personality and other desirable traits implicitly by ranking multiple model
responses relative to each other. Recent issues with model releases highlight
limitations of these existing opaque evaluation approaches: a major model was
rolled back over sycophantic personality issues, models were observed
overfitting to such feedback-based leaderboards. Despite these known issues,
limited public tooling exists to explicitly evaluate model personality. We
introduce Feedback Forensics: an open-source toolkit to track AI personality
changes, both those encouraged by human (or AI) feedback, and those exhibited
across AI models trained and evaluated on such feedback. Leveraging AI
annotators, our toolkit enables investigating personality via Python API and
browser app. We demonstrate the toolkit's usefulness in two steps: (A) first we
analyse the personality traits encouraged in popular human feedback datasets
including Chatbot Arena, MultiPref and PRISM; and (B) then use our toolkit to
analyse how much popular models exhibit such traits. We release (1) our
Feedback Forensics toolkit alongside (2) a web app tracking AI personality in
popular models and feedback datasets as well as (3) the underlying annotation
data at https://github.com/rdnfn/feedback-forensics.
☆ QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization EMNLP 2025
Dialogue summarization aims to distill the core meaning of a conversation
into a concise text. This is crucial for reducing the complexity and noise
inherent in dialogue-heavy applications. While recent approaches typically
train language models to mimic human-written summaries, such supervision is
costly and often results in outputs that lack task-specific focus limiting
their effectiveness in downstream applications, such as medical tasks. In this
paper, we propose \app, a framework for task-oriented utility-based dialogue
summarization. \app starts by generating multiple summaries and task-oriented
question-answer pairs from a dialogue in a zero-shot manner using a pool of
large language models (LLMs). The quality of the generated summaries is
evaluated by having LLMs answer task-related questions before \textit{(i)}
selecting the best candidate answers and \textit{(ii)} identifying the most
informative summary based on these answers. Finally, we fine-tune the best LLM
on the selected summaries. When validated on multiple datasets, \app
demonstrates its effectiveness by achieving competitive results in various
zero-shot settings, rivaling fully-supervised State-of-the-Art (SotA) methods.
comment: Accepted to Empirical Methods in Natural Language Processing (EMNLP
2025)
☆ ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Existing approaches to skill proficiency estimation often rely on black-box
video classifiers, ignoring multi-view context and lacking explainability. We
present ProfVLM, a compact vision-language model that reformulates this task as
generative reasoning: it jointly predicts skill level and generates expert-like
feedback from egocentric and exocentric videos. Central to our method is an
AttentiveGatedProjector that dynamically fuses multi-view features, projected
from a frozen TimeSformer backbone into a language model tuned for feedback
generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses
state-of-the-art methods while using up to 20x fewer parameters and reducing
training time by up to 60%. Our approach not only achieves superior accuracy
across diverse activities, but also outputs natural language critiques aligned
with performance, offering transparent reasoning. These results highlight
generative vision-language modeling as a powerful new direction for skill
assessment.
☆ Optimizing Speech Language Models for Acoustic Consistency
We study speech language models that incorporate semantic initialization and
planning losses to achieve robust and consistent generation. Our approach
initializes speech tokens with self-supervised features, applies a light
alignment loss, and trains with thinning and auxiliary objectives that target
robustness and content planning. We train three models: a 0.7B speech-only
model, a 1.0B speech-only model, and a 1.0B interleaved model with both text
and speech. Acoustic studies show that the speech-only models achieve the
highest consistency across speaker, gender, sentiment, room, and background
factors, surpassing larger systems. Interleaving improves lexical and syntactic
probes and semantic--acoustic alignment but reduces consistency. Linear probes
show that our initialization biases the model toward content structure while
trading off prosody detail. These results show that LM-side design and training
mix control the balance between acoustic stability and semantic grounding
without changes to the tokenizer or runtime architecture. A demo and model
weights are available for exploration.
☆ Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing
Large language models (LLMs) fine-tuning shows excellent implications.
However, vanilla fine-tuning methods often require intricate data mixture and
repeated experiments for optimal generalization. To address these challenges
and streamline the training process, we propose an efficient and universal
solution, Dynamic Boosted Annealing (DBA). We obtain a global gradient through
zero-learning-rate training on general data, which is subsequently employed for
gradient boosting and dynamic training step correction during domain training.
In conjunction with annealing learning, we end up establishing a fine-tuning
pipeline that relies solely on domain data without collapse. By evaluating both
general and domain-specific performance across multiple tasks on several
popular base models, DBA achieves an average improvement of 5.8% in joint
performance over vanilla fine-tuning. Furthermore, since general data is no
longer involved in annealing, repeated experiments led by data mixture are also
eliminated. According to our tests, the DBA method can reduce GPU hours by
91.0% compared to the vanilla method.
comment: 9 pages, 5 figures
☆ Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners
Reinforcement Learning with Verifiable Reward (RLVR) effectively solves
complex tasks but demands extremely long context lengths during training,
leading to substantial computational costs. While multi-stage training can
partially mitigate this, starting with overly short contexts often causes
irreversible performance degradation, ultimately failing to reduce overall
training compute significantly. In this paper, we introduce
**T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet
effective adaptation to RLVR that bridges long Chain-of-Thought (CoT)
distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation,
explicitly discarding the thinking content via a direct ** append, to
reduce token usage during inference. Training with *ThinkFree*-adapted inputs
improves performance and lowers token consumption, even in the original
slow-thinking mode. Extensive experiments across various benchmarks have shown
that TFPI accelerates RL convergence, achieves a higher performance ceiling,
and yields more token-efficient reasoning models without specialized rewards or
complex training designs. With TFPI only, we train a 4B model to reach 89.0%
accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.
☆ Type-Less yet Type-Aware Inductive Link Prediction with Pretrained Language Models EMNLP 2025
Alessandro De Bellis, Salvatore Bufi, Giovanni Servedio, Vito Walter Anelli, Tommaso Di Noia, Eugenio Di Sciascio
Inductive link prediction is emerging as a key paradigm for real-world
knowledge graphs (KGs), where new entities frequently appear and models must
generalize to them without retraining. Predicting links in a KG faces the
challenge of guessing previously unseen entities by leveraging generalizable
node features such as subgraph structure, type annotations, and ontological
constraints. However, explicit type information is often lacking or incomplete.
Even when available, type information in most KGs is often coarse-grained,
sparse, and prone to errors due to human annotation. In this work, we explore
the potential of pre-trained language models (PLMs) to enrich node
representations with implicit type signals. We introduce TyleR, a Type-less yet
type-awaRe approach for subgraph-based inductive link prediction that leverages
PLMs for semantic enrichment. Experiments on standard benchmarks demonstrate
that TyleR outperforms state-of-the-art baselines in scenarios with scarce type
annotations and sparse graph connectivity. To ensure reproducibility, we share
our code at https://github.com/sisinflab/tyler .
comment: Accepted and to appear in Proceedings of the 2025 Conference on
Empirical Methods in Natural Language Processing (EMNLP 2025)
☆ Comparative Analysis of Ant Colony Optimization and Google OR-Tools for Solving the Open Capacitated Vehicle Routing Problem in Logistics
In modern logistics management systems, route planning requires high
efficiency. The Open Capacitated Vehicle Routing Problem (OCVRP) deals with
finding optimal delivery routes for a fleet of vehicles serving geographically
distributed customers, without requiring the vehicles to return to the depot
after deliveries. The present study is comparative in nature and speaks of two
algorithms for OCVRP solution: Ant Colony Optimization (ACO), a nature-inspired
metaheuristic; and Google OR-Tools, an industry-standard toolkit for
optimization. Both implementations were developed in Python and using a custom
dataset. Performance appraisal was based on routing efficiency, computation
time, and scalability. The results show that ACO allows flexibility in routing
parameters while OR-Tools runs much faster with more consistency and requires
less input. This could help choose among routing strategies for scalable
real-time logistics systems.
comment: 6 pages, accepted at Intelligent Methods, Systems, and Applications
(IMSA 2025)
☆ VietBinoculars: A Zero-Shot Approach for Detecting Vietnamese LLM-Generated Text
The rapid development research of Large Language Models (LLMs) based on
transformer architectures raises key challenges, one of them being the task of
distinguishing between human-written text and LLM-generated text. As
LLM-generated textual content, becomes increasingly complex over time, and
resembles human writing, traditional detection methods are proving less
effective, especially as the number and diversity of LLMs continue to grow with
new models and versions being released at a rapid pace. This study proposes
VietBinoculars, an adaptation of the Binoculars method with optimized global
thresholds, to enhance the detection of Vietnamese LLM-generated text. We have
constructed new Vietnamese AI-generated datasets to determine the optimal
thresholds for VietBinoculars and to enable benchmarking. The results from our
experiments show results show that VietBinoculars achieves over 99\% in all two
domains of accuracy, F1-score, and AUC on multiple out-of-domain datasets. It
outperforms the original Binoculars model, traditional detection methods, and
other state-of-the-art approaches, including commercial tools such as ZeroGPT
and DetectGPT, especially under specially modified prompting strategies.
comment: 27 pages
☆ Auto-ARGUE: LLM-Based Report Generation Evaluation ECIR 2025
William Walden, Marc Mason, Orion Weller, Laura Dietz, Hannah Recknor, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, James Mayfield, Eugene Yang
Generation of long-form, citation-backed reports is a primary use case for
retrieval augmented generation (RAG) systems. While open-source evaluation
tools exist for various RAG tasks, ones tailored to report generation are
lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based
implementation of the recent ARGUE framework for report generation evaluation.
We present analysis of Auto-ARGUE on the report generation pilot task from the
TREC 2024 NeuCLIR track, showing good system-level correlations with human
judgments. We further release a web app for visualization of Auto-ARGUE
outputs.
comment: ECIR 2025 demo format
☆ Explaining novel senses using definition generation with open language models
We apply definition generators based on open-weights large language models to
the task of creating explanations of novel senses, taking target word usages as
an input. To this end, we employ the datasets from the AXOLOTL'24 shared task
on explainable semantic change modeling, which features Finnish, Russian and
German languages. We fine-tune and provide publicly the open-source models
performing higher than the best submissions of the aforementioned shared task,
which employed closed proprietary LLMs. In addition, we find that
encoder-decoder definition generators perform on par with their decoder-only
counterparts.
☆ MGen: Millions of Naturally Occurring Generics in Context SC
MGen is a dataset of over 4 million naturally occurring generic and
quantified sentences extracted from diverse textual sources. Sentences in the
dataset have long context documents, corresponding to websites and academic
papers, and cover 11 different quantifiers. We analyze the features of generics
sentences in the dataset, with interesting insights: generics can be long
sentences (averaging over 16 words) and speakers often use them to express
generalisations about people.
MGen is the biggest and most diverse dataset of naturally occurring generic
sentences, opening the door to large-scale computational research on
genericity. It is publicly available at https://gustavocilleruelo.com/mgen
comment: Presented at SCiL 2025
☆ CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models
With their growing capabilities, generative large language models (LLMs) are
being increasingly investigated for complex medical tasks. However, their
effectiveness in real-world clinical applications remains underexplored. To
address this, we present CliniBench, the first benchmark that enables
comparability of well-studied encoder-based classifiers and generative LLMs for
discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our
extensive study compares 12 generative LLMs and 3 encoder-based classifiers and
demonstrates that encoder-based classifiers consistently outperform generative
models in diagnosis prediction. We assess several retrieval augmentation
strategies for in-context learning from similar patients and find that they
provide notable performance improvements for generative LLMs.
☆ The Hunger Game Debate: On the Emergence of Over-Competition in Multi-Agent Systems
Xinbei Ma, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Mengru Wang, Jen-tse Huang, Qu Yang, Wenxuan Wang, Fanghua Ye, Qingxuan Jiang, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Hai Zhao, Zhaopeng Tu, Xiaolong Li, Linus
LLM-based multi-agent systems demonstrate great potential for tackling
complex problems, but how competition shapes their behavior remains
underexplored. This paper investigates the over-competition in multi-agent
debate, where agents under extreme pressure exhibit unreliable, harmful
behaviors that undermine both collaboration and task performance. To study this
phenomenon, we propose HATE, the Hunger Game Debate, a novel experimental
framework that simulates debates under a zero-sum competition arena. Our
experiments, conducted across a range of LLMs and tasks, reveal that
competitive pressure significantly stimulates over-competition behaviors and
degrades task performance, causing discussions to derail. We further explore
the impact of environmental feedback by adding variants of judges, indicating
that objective, task-focused feedback effectively mitigates the
over-competition behaviors. We also probe the post-hoc kindness of LLMs and
form a leaderboard to characterize top LLMs, providing insights for
understanding and governing the emergent social dynamics of AI community.
☆ Vocabulary Customization for Efficient Domain-Specific LLM Deployment
When using an LLM to process text outside the training domain(s), an often
overlooked factor is vocabulary mismatch, where the general-domain tokenizer
fails to capture frequent domain-specific terms, leading to higher token
fertility and thus a decrease in processing speed due to suboptimal sub-word
splits.
We address this limitation by augmenting the pretrained vocabulary with a set
of domain-specific tokens. To this end, we design an algorithm that extends an
existing tokenizer while guaranteeing it never decreases tokenization
efficiency: every input sequence is segmented into at most the same number of
tokens as before.
Evaluated on real-world e-Commerce use-cases, the augmented tokenizer
significantly shortens input sequences by up to 20% and reduces inference
latency on downstream tasks while preserving predictive quality. We further
analyze secondary effects, such as the impact on forward pass speed and the
rate at which the model adopts the newly introduced tokens, to illustrate the
broader benefits of vocabulary adaptation.
comment: Accepted at NEURIPS 2025 CCFM Workshop
☆ End-to-End Aspect-Guided Review Summarization at Scale EMNLP 2025
Ilya Boytsov, Vinny DeGenova, Mikhail Balyasin, Joseph Walt, Caitlin Eusden, Marie-Claire Rochat, Margaret Pierson
We present a scalable large language model (LLM)-based system that combines
aspect-based sentiment analysis (ABSA) with guided summarization to generate
concise and interpretable product review summaries for the Wayfair platform.
Our approach first extracts and consolidates aspect-sentiment pairs from
individual reviews, selects the most frequent aspects for each product, and
samples representative reviews accordingly. These are used to construct
structured prompts that guide the LLM to produce summaries grounded in actual
customer feedback. We demonstrate the real-world effectiveness of our system
through a large-scale online A/B test. Furthermore, we describe our real-time
deployment strategy and release a dataset of 11.8 million anonymized customer
reviews covering 92,000 products, including extracted aspects and generated
summaries, to support future research in aspect-guided review summarization.
comment: Camera-ready preprint for EMNLP 2025 Industry Track
☆ Reinforced Strategy Optimization for Conversational Recommender Systems via Network-of-Experts
Conversational Recommender Systems (CRSs) provide personalized
recommendations through multi-turn interactions. With the strong reasoning
abilities of Large Language Models (LLMs), applying them to CRSs has become
promising. Yet, existing methods often lack explicit optimization of
interaction strategies, relying instead on unified prompts, which can yield
suboptimal outcomes. We propose Reinforced Strategy Optimization (RSO), a
hierarchical framework that decomposes response generation into macro-level
strategy planning and micro-level adaptation within a network-of-experts. A
Planner selects strategies (e.g., recommend, explain, encourage), while an
Actor generates responses guided by auxiliary experts for preferences and
factual grounding. This disentanglement enables more tractable learning. To
address limited multi-turn data, we model strategy learning as reinforcement
learning with an LLM-based reward for exploration. Experiments show RSO
outperforms state-of-the-art baselines, validating the effectiveness of
hierarchical strategy optimization.
☆ IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation
Johannes Schmitt, Gergely Bérczi, Jasper Dekoninck, Jeremy Feusi, Tim Gehrunger, Raphael Appenzeller, Jim Bryan, Niklas Canova, Timo de Wolff, Filippo Gaia, Michel van Garrel, Baran Hashemi, David Holmes, Aitor Iribar Lopez, Victor Jaeck, Martina Jørgensen, Steven Kelk, Stefan Kuhlmann, Adam Kurpisz, Chiara Meroni, Ingmar Metzler, Martin Möller, Samuel Muñoz-Echániz, Robert Nowak, Georg Oberdieck, Daniel Platt, Dylan Possamaï, Gabriel Ribeiro, Raúl Sánchez Galán, Zheming Sun, Josef Teichmann, Richard P. Thomas, Charles Vial
As the mathematical capabilities of large language models (LLMs) improve, it
becomes increasingly important to evaluate their performance on research-level
tasks at the frontier of mathematical knowledge. However, existing benchmarks
are limited, as they focus solely on final-answer questions or high-school
competition problems. To address this gap, we introduce IMProofBench, a private
benchmark consisting of 39 peer-reviewed problems developed by expert
mathematicians. Each problem requires a detailed proof and is paired with
subproblems that have final answers, supporting both an evaluation of
mathematical reasoning capabilities by human experts and a large-scale
quantitative analysis through automated grading. Furthermore, unlike prior
benchmarks, the evaluation setup simulates a realistic research environment:
models operate in an agentic framework with tools like web search for
literature review and mathematical software such as SageMath. Our results show
that current LLMs can succeed at the more accessible research-level questions,
but still encounter significant difficulties on more challenging problems.
Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer
subproblems, while GPT-5 obtains the best performance for proof generation,
achieving a fully correct solution for 22% of problems. IMProofBench will
continue to evolve as a dynamic benchmark in collaboration with the
mathematical community, ensuring its relevance for evaluating the next
generation of LLMs.
☆ Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis NeurIPS 2025
Reward modeling, crucial for aligning large language models (LLMs) with human
preferences, is often bottlenecked by the high cost of preference data.
Existing textual data synthesis methods are computationally expensive. We
propose a novel framework LENS for synthesizing preference data directly in the
LLM's latent embedding space. Our method employs a Variational Autoencoder
(VAE) to learn a structured latent representation of response embeddings. By
performing controlled perturbations in this latent space and decoding back to
the embedding space, we efficiently generate diverse, semantically consistent
synthetic preference pairs, bypassing costly text generation and annotation. We
provide theoretical guarantees that our synthesized pairs approximately
preserve original preference ordering and improve reward model generalization.
Empirically, our latent-space synthesis significantly outperforms text-based
augmentation on standard benchmarks, achieving superior results while being 18x
faster in generation and using a 16,000x smaller model. Our work offers a
scalable and effective alternative for enhancing reward modeling through
efficient data augmentation. Code is publicly available at
https://github.com/deeplearning-wisc/lens
comment: Accepted by NeurIPS 2025
☆ The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
Large language models (LLMs) are increasingly deployed as automatic judges to
evaluate system outputs in tasks such as summarization, dialogue, and creative
writing. A faithful judge should base its verdicts solely on response quality
and explicitly acknowledge the factors shaping its decision. We show that
current LLM judges fail on both counts by relying on shortcuts introduced in
the prompt. Our study uses two evaluation datasets: ELI5, a benchmark for
long-form question answering, and LitBench, a recent benchmark for creative
writing. Both datasets provide pairwise comparisons, where the evaluator must
choose which of two responses is better. From each dataset we construct 100
pairwise judgment tasks and employ two widely used models, GPT-4o and
Gemini-2.5-Flash, as evaluators in the role of LLM-as-a-judge. For each pair,
we assign superficial cues to the responses, provenance cues indicating source
identity (Human, Expert, LLM, or Unknown) and recency cues indicating temporal
origin (Old, 1950 vs. New, 2025), while keeping the rest of the prompt fixed.
Results reveal consistent verdict shifts: both models exhibit a strong recency
bias, systematically favoring new responses over old, as well as a clear
provenance hierarchy (Expert > Human > LLM > Unknown). These biases are
especially pronounced in GPT-4o and in the more subjective and open-ended
LitBench domain. Crucially, cue acknowledgment is rare: justifications almost
never reference the injected cues, instead rationalizing decisions in terms of
content qualities. These findings demonstrate that current LLM-as-a-judge
systems are shortcut-prone and unfaithful, undermining their reliability as
evaluators in both research and deployment.
comment: 9 Pages, 5 Tables, 1 Figures
☆ DyFlow: Dynamic Workflow Framework for Agentic Reasoning
Yanbo Wang, Zixiang Xu, Yue Huang, Xiangqi Wang, Zirui Song, Lang Gao, Chenxi Wang, Xiangru Tang, Yue Zhao, Arman Cohan, Xiangliang Zhang, Xiuying Chen
Agent systems based on large language models (LLMs) have shown great
potential in complex reasoning tasks, but building efficient and generalizable
workflows remains a major challenge. Most existing approaches rely on manually
designed processes, which limits their adaptability across different tasks.
While a few methods attempt automated workflow generation, they are often tied
to specific datasets or query types and make limited use of intermediate
feedback, reducing system robustness and reasoning depth. Moreover, their
operations are typically predefined and inflexible. To address these
limitations, we propose DyFlow, a dynamic workflow generation framework that
adaptively constructs and adjusts reasoning procedures based on task
requirements and real-time intermediate feedback, thereby enhancing cross-task
generalization. DyFlow consists of two core components: a designer and an
executor. The designer decomposes complex problems into a sequence of sub-goals
defined by high-level objectives and dynamically plans the next steps based on
intermediate outputs and feedback. These plans are then carried out by the
executor, which executes each operation using dynamic operators with
context-aware parameterization, enabling flexible and semantically grounded
reasoning. We systematically evaluate DyFlow across diverse domains, including
social reasoning, biomedical tasks, mathematical problem solving, and code
generation. Results demonstrate that DyFlow significantly outperforms existing
baselines, achieving substantial Pass@k improvements and exhibiting robust
generalization across diverse domains. The code is publicly available at
https://github.com/wyf23187/DyFlow.
☆ CEAID: Benchmark of Multilingual Machine-Generated Text Detection Methods for Central European Languages
Machine-generated text detection, as an important task, is predominantly
focused on English in research. This makes the existing detectors almost
unusable for non-English languages, relying purely on cross-lingual
transferability. There exist only a few works focused on any of Central
European languages, leaving the transferability towards these languages rather
unexplored. We fill this gap by providing the first benchmark of detection
methods focused on this region, while also providing comparison of
train-languages combinations to identify the best performing ones. We focus on
multi-domain, multi-generator, and multilingual evaluation, pinpointing the
differences of individual aspects, as well as adversarial robustness of
detection methods. Supervised finetuned detectors in the Central European
languages are found the most performant in these languages as well as the most
resistant against obfuscation.
☆ RE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflection
Daocheng Fu, Jianbiao Mei, Licheng Wen, Xuemeng Yang, Cheng Yang, Rong Wu, Tao Hu, Siqi Li, Yufan Shen, Xinyu Cai, Pinlong Cai, Botian Shi, Yong Liu, Yu Qiao
Large language models (LLMs) excel at knowledge-intensive question answering
and reasoning, yet their real-world deployment remains constrained by knowledge
cutoff, hallucination, and limited interaction modalities. Augmenting LLMs with
external search tools helps alleviate these issues, but it also exposes agents
to a complex search environment in which small, plausible variations in query
formulation can steer reasoning into unproductive trajectories and amplify
errors. We present a systematic analysis that quantifies how environmental
complexity induces fragile search behaviors and, in turn, degrades overall
performance. To address this challenge, we propose a simple yet effective
approach to instantiate a search agent, RE-Searcher. During search, RE-Searcher
explicitly articulates a concrete search goal and subsequently reflects on
whether the retrieved evidence satisfies that goal. This combination of
goal-oriented planning and self-reflection enables RE-Searcher to resist
spurious cues in complex search environments and perform robust search.
Extensive experiments show that our method improves search accuracy and
achieves state-of-the-art results. Perturbation studies further demonstrate
substantial resilience to noisy or misleading external signals, mitigating the
fragility of the search process. We believe these findings offer practical
guidance for integrating LLM-powered agents into more complex interactive
environments and enabling more autonomous decision-making.
comment: 16 pages, 7 figures
☆ Scaling Up Temporal Domain Generalization via Temporal Experts Averaging EMNLP 2025
Aoming Liu, Kevin Miller, Venkatesh Saligrama, Kate Saenko, Boqing Gong, Ser-Nam Lim, Bryan A. Plummer
Temporal Domain Generalization (TDG) aims to generalize across temporal
distribution shifts, e.g., lexical change over time. Prior work often addresses
this by predicting future model weights. However, full model prediction is
prohibitively expensive for even reasonably sized models. Thus, recent methods
only predict the classifier layer, limiting generalization by failing to adjust
other model components. To address this, we propose Temporal Experts Averaging
(TEA), a novel and scalable TDG framework that updates the entire model using
weight averaging to maximize generalization potential while minimizing
computational costs. Our theoretical analysis guides us to two steps that
enhance generalization to future domains. First, we create expert models with
functional diversity yet parameter similarity by fine-tuning a domain-agnostic
base model on individual temporal domains while constraining weight changes.
Second, we optimize the bias-variance tradeoff through adaptive averaging
coefficients derived from modeling temporal weight trajectories in a principal
component subspace. Expert's contributions are based on their projected
proximity to future domains. Extensive experiments across 7 TDG benchmarks, 5
models, and 2 TDG settings shows TEA outperforms prior TDG methods by up to 69%
while being up to 60x more efficient.
comment: Accepted by EMNLP 2025 main
☆ Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning
Arash Marioriyad, Shaygan Adim, Nima Alighardashi, Mahdieh Soleymani Banghshah, Mohammad Hossein Rohban
Large language models (LLMs) increasingly rely on chain-of-thought (CoT)
prompting to solve mathematical and logical reasoning tasks. Yet, a central
question remains: to what extent are these generated rationales \emph{faithful}
to the underlying computations, rather than post-hoc narratives shaped by hints
that function as answer shortcuts embedded in the prompt? Following prior work
on hinted vs.\ unhinted prompting, we present a systematic study of CoT
faithfulness under controlled hint manipulations. Our experimental design spans
four datasets (AIME, GSM-Hard, MATH-500, UniADILR), two state-of-the-art models
(GPT-4o and Gemini-2-Flash), and a structured set of hint conditions varying in
correctness (correct and incorrect), presentation style (sycophancy and data
leak), and complexity (raw answers, two-operator expressions, four-operator
expressions). We evaluate both task accuracy and whether hints are explicitly
acknowledged in the reasoning. Our results reveal three key findings. First,
correct hints substantially improve accuracy, especially on harder benchmarks
and logical reasoning, while incorrect hints sharply reduce accuracy in tasks
with lower baseline competence. Second, acknowledgement of hints is highly
uneven: equation-based hints are frequently referenced, whereas raw hints are
often adopted silently, indicating that more complex hints push models toward
verbalizing their reliance in the reasoning process. Third, presentation style
matters: sycophancy prompts encourage overt acknowledgement, while leak-style
prompts increase accuracy but promote hidden reliance. This may reflect
RLHF-related effects, as sycophancy exploits the human-pleasing side and data
leak triggers the self-censoring side. Together, these results demonstrate that
LLM reasoning is systematically shaped by shortcuts in ways that obscure
faithfulness.
comment: 5 Pages, 4 Figures, 4 Tables
☆ RE$^2$: Improving Chinese Grammatical Error Correction via Retrieving Appropriate Examples with Explanation
The primary objective of Chinese grammatical error correction (CGEC) is to
detect and correct errors in Chinese sentences. Recent research shows that
large language models (LLMs) have been applied to CGEC with significant
results. For LLMs, selecting appropriate reference examples can help improve
their performance. However, existing methods predominantly rely on text
similarity for example retrieval, a strategy that frequently mismatches actual
error patterns and retrieves lexically similar yet grammatically irrelevant
sentences. To address this problem, we propose a method named RE$^2$, which
retrieves appropriate examples with explanations of grammatical errors. Instead
of using text similarity of the input sentence, we use explanations of
grammatical errors to select reference examples, which are used by LLMs to
improve the performance of CGEC. We conduct experiments on two CGEC datasets
and create a high-quality grammatical error explanation (GEE) dataset, which is
not only used in our research but also serves as a valuable resource for future
studies in both CGEC and GEE. The experimental results on the two datasets
indicate that our proposed method effectively improves the performance of CGEC.
☆ FITS: Towards an AI-Driven Fashion Information Tool for Sustainability ECAI 2025
Access to credible sustainability information in the fashion industry remains
limited and challenging to interpret, despite growing public and regulatory
demands for transparency. General-purpose language models often lack
domain-specific knowledge and tend to "hallucinate", which is particularly
harmful for fields where factual correctness is crucial. This work explores how
Natural Language Processing (NLP) techniques can be applied to classify
sustainability data for fashion brands, thereby addressing the scarcity of
credible and accessible information in this domain. We present a prototype
Fashion Information Tool for Sustainability (FITS), a transformer-based system
that extracts and classifies sustainability information from credible,
unstructured text sources: NGO reports and scientific publications. Several
BERT-based language models, including models pretrained on scientific and
climate-specific data, are fine-tuned on our curated corpus using a
domain-specific classification schema, with hyperparameters optimized via
Bayesian optimization. FITS allows users to search for relevant data, analyze
their own data, and explore the information via an interactive interface. We
evaluated FITS in two focus groups of potential users concerning usability,
visual design, content clarity, possible use cases, and desired features. Our
results highlight the value of domain-adapted NLP in promoting informed
decision-making and emphasize the broader potential of AI applications in
addressing climate-related challenges. Finally, this work provides a valuable
dataset, the SustainableTextileCorpus, along with a methodology for future
updates. Code available at https://github.com/daphne12345/FITS
comment: accepted at ECAI 2025
☆ RAGferee: Building Contextual Reward Models for Retrieval-Augmented Generation EMNLP 2025
Andrei C. Coman, Ionut-Teodor Sorodoc, Leonardo F. R. Ribeiro, Bill Byrne, James Henderson, Adrià de Gispert
Existing Reward Models (RMs), typically trained on general preference data,
struggle in Retrieval Augmented Generation (RAG) settings, which require
judging responses for faithfulness to retrieved context, relevance to the user
query, appropriate refusals when context is insufficient, completeness and
conciseness of information. To address the lack of publicly available
RAG-centric preference datasets and specialised RMs, we introduce RAGferee, a
methodology that repurposes question-answering (QA) datasets into preference
pairs that prioritise groundedness over stylistic features, enabling the
training of contextual RMs better suited to judging RAG responses. Using
RAGferee, we curate a small preference dataset of 4K samples and fine-tune RMs
ranging from 7B to 24B parameters. Our RAG-centric RMs achieve state-of-the-art
performance on ContextualJudgeBench, surpassing existing 70B+ RMs trained on
much larger (up to 2.4M samples) general corpora, with an absolute improvement
of +15.5%.
comment: Accepted to EMNLP 2025 Main
☆ CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models
Sparsity-aware training is an effective approach for transforming large
language models (LLMs) into hardware-friendly sparse patterns, thereby reducing
latency and memory consumption during inference. In this paper, we propose
Continuous Adaptive Sparse Trainer (CAST), a fully continuous and
differentiable sparsity-aware training framework for semi-structured (or "N:M")
sparse models. Unlike previous approaches that optimize sparsity patterns and
weights separately, CAST enables seamless joint optimization during training,
while progressively transforming the model into the desired sparsity format.
Specifically, CAST introduces three key components: 1) AdamS, a sparsity-aware
optimizer that leverages adaptive L1 decay to promote uniform sparsification
across all parameters; 2) Weight Scaling, a module designed to mitigate the
magnitude reduction caused by decay while preserving desired sparsity patterns;
3) Knowledge Distillation, which employs the dense model as a self-teacher to
enhance training efficiency. We evaluate CAST under 2:4 sparsity patterns
across multiple model families, ranging from 125M to 13B parameters. Our
results demonstrate significant improvements over previous state-of-the-art
methods in both perplexity and zero-shot accuracy with minimal training
resources. Notably, on LLaMA2-7B, our 2:4 sparse model achieves a negligible
perplexity increase of 0.09 and a 0.36% gain in zero-shot accuracy compared to
the dense model using only 2% of the original pretraining tokens. Additionally,
we establish an accurate and robust empirical scaling law to predict sparse
model performance given adequate training resources. Finally, we demonstrate
the practical applicability of our sparse models by evaluating them under
quantization and fine-tuning scenarios.
comment: Submitted to IEEE TPAMI
☆ Reliability Crisis of Reference-free Metrics for Grammatical Error Correction EMNLP 2025
Reference-free evaluation metrics for grammatical error correction (GEC) have
achieved high correlation with human judgments. However, these metrics are not
designed to evaluate adversarial systems that aim to obtain unjustifiably high
scores. The existence of such systems undermines the reliability of automatic
evaluation, as it can mislead users in selecting appropriate GEC systems. In
this study, we propose adversarial attack strategies for four reference-free
metrics: SOME, Scribendi, IMPARA, and LLM-based metrics, and demonstrate that
our adversarial systems outperform the current state-of-the-art. These findings
highlight the need for more robust evaluation methods.
comment: EMNLP 2025 Findings
☆ RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning
Reinforcement learning with verifiable rewards (RLVR) has proven effective in
eliciting complex reasoning in large language models (LLMs). However, standard
RLVR training often leads to excessively verbose processes (in reasoning tasks)
and inefficient exploration trajectories (in agentic settings), as outcome-only
rewards provide no incentive for efficiency and the high variance in response
length within relatively small rollout groups results in noisy optimization
signals. To address this, we propose Rollout Response Recomposition (RoRecomp),
a plug-and-play method that guides models toward concise reasoning by
strategically recomposing the training data. RoRecomp separates responses into
two distinct batch types: 1) priority batches, which combine short-correct and
long-incorrect responses selected from online batches to provide a clear
gradient signal for brevity, and 2) compensation batches, which utilize
remaining responses from a replay buffer to maintain stability and prevent
model collapse. To comprehensively evaluate effectiveness, we test RoRecomp
across three settings where results demonstrate substantial efficiency gains:
reducing reasoning length by 27.7% in zero RL training, reducing unnecessary
tool calls by 46.8% while improving accuracy in agentic RL, and achieving up to
52.5% length reduction in thinking compression, all with minimal performance
impact.
☆ Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA
Reasoning quality in large language models depends not only on producing
correct answers but also on generating valid intermediate steps. We study this
through multiple-choice question answering (MCQA), which provides a controlled
setting with fixed answer options. Our analysis shows that when questions are
effectively unsolvable for a model, spurious chains of thought (CoTs) are more
likely to appear, leading to false positives. By estimating the solvability of
each question, we uncover an intermediate regime where learning is most
effective. Building on this insight, we adapt outcome-supervised reward models
and reinforcement learning with group-relative advantage to incorporate
solvability into their objectives. Across experiments on math and multimodal
datasets, these modifications consistently yield higher rates of
process-correct reasoning and, in reinforcement learning, improved answer
accuracy as well. Our results highlight solvability as a key factor for
reducing hallucinations and increasing reliability in CoT reasoning.
☆ DeepJSONEval: Benchmarking Complex Nested JSON Data Mining for Large Language Models
The internet is saturated with low-density, high-redundancy information, such
as social media comments, repetitive news, and lengthy discussions, making it
difficult to extract valuable insights efficiently. Multi-layer nested JSON
structures provide an effective solution by compressing such information into
semantically rich, hierarchical representations, which organize data into
key-value pairs, arrays, and nested objects, preserving contextual
relationships and enabling efficient storage, retrieval, and semantic querying.
For instance, in news aggregation, a JSON object can nest an article's metadata
(title, author, date), content (text, multimedia), and multimedia information
(multimedia type, caption) hierarchically. Large Language Models (LLMs) play a
transformative role in web data mining by parsing unstructured text and
outputting structured results directly into complex JSON schemas. However,
current benchmarks for evaluating LLMs' JSON output capabilities overemphasize
pure JSON generation rather than assessing data comprehension and extraction
abilities, a limitation that lacks relevance to practical web data mining
tasks. To address this, we introduce DeepJSONEval, a novel benchmark featuring
2100 multi-domain instances with deep nested structures, categorized by
difficulty. Experiments show significant performance gaps among LLMs in
handling such complexity. Our benchmark and datasets are open-sourced to
advance research in structured JSON
generation.(https://github.com/GTS-AI-Infra-Lab-SotaS/DeepJSONEval).
☆ Bringing Emerging Architectures to Sequence Labeling in NLP
Pretrained Transformer encoders are the dominant approach to sequence
labeling. While some alternative architectures-such as xLSTMs, structured
state-space models, diffusion models, and adversarial learning-have shown
promise in language modeling, few have been applied to sequence labeling, and
mostly on flat or simplified tasks. We study how these architectures adapt
across tagging tasks that vary in structural complexity, label space, and token
dependencies, with evaluation spanning multiple languages. We find that the
strong performance previously observed in simpler settings does not always
generalize well across languages or datasets, nor does it extend to more
complex structured tasks.
☆ VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
Vision-Language Models (VLMs) excel at high-level scene understanding but
falter on fine-grained perception tasks requiring precise localization. This
failure stems from a fundamental mismatch, as generating exact numerical
coordinates is a challenging task for language-centric architectures. In this
paper, we introduce VLM-FO1, a novel framework that overcomes this limitation
by reframing object-centric perception from a brittle coordinate generation
problem into a robust feature retrieval task. Our method operates as a
plug-and-play module that integrates with any pre-trained VLM. It leverages a
Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to
generate powerful region tokens rich in both semantic and spatial detail. A
token-based referencing system then enables the LLM to seamlessly reason about
and ground language in these specific visual regions. Experiments show that
VLM-FO1 achieves state-of-the-art performance across a diverse suite of
benchmarks, demonstrating exceptional capabilities in object grounding, region
generational understanding, and visual region reasoning. Crucially, our
two-stage training strategy ensures that these perception gains are achieved
without compromising the base model's general visual understanding
capabilities. VLM-FO1 establishes an effective and flexible paradigm for
building perception-aware VLMs, bridging the gap between high-level reasoning
and fine-grained visual grounding.
comment: 22 pages
☆ Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
Chuanyang Zheng, Jiankai Sun, Yihang Gao, Enze Xie, Yuehao Wang, Peihao Wang, Ting Xu, Matthew Chang, Liliang Ren, Jingyao Li, Jing Xiong, Kashif Rasul, Mac Schwager, Anderson Schneider, Zhangyang Wang, Yuriy Nevmyvaka
Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art
large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$
as the router score function to aggregate expert output, a designed choice that
has persisted from the earliest MoE models to modern LLMs, and is now widely
regarded as standard practice. However, the necessity of using
$\mathrm{Softmax}$ to project router weights into a probability simplex remains
an unchallenged assumption rather than a principled design choice. In this
work, we first revisit the classical Nadaraya-Watson regression and observe
that MoE shares the same mathematical formulation as Nadaraya-Watson
regression. Furthermore, we show that both feed-forward neural network (FFN)
and MoE can be interpreted as a special case of Nadaraya-Watson regression,
where the kernel function corresponds to the input neurons of the output layer.
Motivated by these insights, we propose the \textbf{zero-additional-cost}
Kernel Inspired Router with Normalization (KERN), an FFN-style router function,
as an alternative to $\mathrm{Softmax}$. We demonstrate that this router
generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers.
\textbf{Based on empirical observations and established practices in FFN
implementation, we recommend the use of $\mathrm{ReLU}$ activation and
$\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive
experiments in MoE and LLM validate the effectiveness of the proposed FFN-style
router function \methodNorm.
comment: Tech Report
☆ Mem-α: Learning Memory Construction via Reinforcement Learning
Large language model (LLM) agents are constrained by limited context windows,
necessitating external memory systems for long-term information understanding.
Current memory-augmented agents typically depend on pre-defined instructions
and tools for memory updates. However, language models may lack the ability to
determine which information to store, how to structure it, and when to update
it, especially as memory systems become more complex. This results in
suboptimal memory construction and information loss. To this end, we propose
Mem-alpha, a reinforcement learning framework that trains agents to effectively
manage complex memory systems through interaction and feedback. We also
construct a specialized training dataset spanning diverse multi-turn
interaction patterns paired with comprehensive evaluation questions designed to
teach effective memory management. During training, agents process sequential
information chunks, learn to extract and store relevant content, then update
the memory system. The reward signal derives from downstream question-answering
accuracy over the full interaction history, directly optimizing for memory
construction. To illustrate the effectiveness of our training framework, we
design a memory architecture comprising core, episodic, and semantic
components, equipped with multiple tools for memory operations. Empirical
evaluation demonstrates that Mem-alpha achieves significant improvements over
existing memory-augmented agent baselines. Despite being trained exclusively on
instances with a maximum length of 30k tokens, our agents exhibit remarkable
generalization to sequences exceeding 400k tokens, over 13x the training
length, highlighting the robustness of Mem-alpha.
☆ PerQ: Efficient Evaluation of Multilingual Text Personalization Quality
Since no metrics are available to evaluate specific aspects of a text, such
as its personalization quality, the researchers often rely solely on large
language models to meta-evaluate such texts. Due to internal biases of
individual language models, it is recommended to use multiple of them for
combined evaluation, which directly increases costs of such meta-evaluation. In
this paper, a computationally efficient method for evaluation of
personalization quality of a given text (generated by a language model) is
introduced, called PerQ. A case study of comparison of generation capabilities
of large and small language models shows the usability of the proposed metric
in research, effectively reducing the waste of resources.
☆ RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity
Humans often encounter role conflicts -- social dilemmas where the
expectations of multiple roles clash and cannot be simultaneously fulfilled. As
large language models (LLMs) become increasingly influential in human
decision-making, understanding how they behave in complex social situations is
essential. While previous research has evaluated LLMs' social abilities in
contexts with predefined correct answers, role conflicts represent inherently
ambiguous social dilemmas that require contextual sensitivity: the ability to
recognize and appropriately weigh situational cues that can fundamentally alter
decision priorities. To address this gap, we introduce RoleConflictBench, a
novel benchmark designed to evaluate LLMs' contextual sensitivity in complex
social dilemmas. Our benchmark employs a three-stage pipeline to generate over
13K realistic role conflict scenarios across 65 roles, systematically varying
their associated expectations (i.e., their responsibilities and obligations)
and situational urgency levels. By analyzing model choices across 10 different
LLMs, we find that while LLMs show some capacity to respond to these contextual
cues, this sensitivity is insufficient. Instead, their decisions are
predominantly governed by a powerful, inherent bias related to social roles
rather than situational information. Our analysis quantifies these biases,
revealing a dominant preference for roles within the Family and Occupation
domains, as well as a clear prioritization of male roles and Abrahamic
religions across most evaluatee models.
☆ A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI
Arvind Murari Vepa, Yannan Yu, Jingru Gan, Anthony Cuturrufo, Weikai Li, Wei Wang, Fabien Scalzo, Yizhou Sun
We introduce mpLLM, a prompt-conditioned hierarchical mixture-of-experts
(MoE) architecture for visual question answering over multi-parametric 3D brain
MRI (mpMRI). mpLLM routes across modality-level and token-level projection
experts to fuse multiple interrelated 3D modalities, enabling efficient
training without image--report pretraining. To address limited image-text
paired supervision, mpLLM integrates a synthetic visual question answering
(VQA) protocol that generates medically relevant VQA from segmentation
annotations, and we collaborate with medical experts for clinical validation.
mpLLM outperforms strong medical VLM baselines by 5.3% on average across
multiple mpMRI datasets. Our study features three main contributions: (1) the
first clinically validated VQA dataset for 3D brain mpMRI, (2) a novel
multimodal LLM that handles multiple interrelated 3D modalities, and (3) strong
empirical results that demonstrate the medical utility of our methodology.
Ablations highlight the importance of modality-level and token-level experts
and prompt-conditioned routing. We have included our source code in the
supplementary materials and will release our dataset upon publication.
comment: 23 pages, 3 figures
☆ ASR Under Noise: Exploring Robustness for Sundanese and Javanese
We investigate the robustness of Whisper-based automatic speech recognition
(ASR) models for two major Indonesian regional languages: Javanese and
Sundanese. While recent work has demonstrated strong ASR performance under
clean conditions, their effectiveness in noisy environments remains unclear. To
address this, we experiment with multiple training strategies, including
synthetic noise augmentation and SpecAugment, and evaluate performance across a
range of signal-to-noise ratios (SNRs). Our results show that noise-aware
training substantially improves robustness, particularly for larger Whisper
models. A detailed error analysis further reveals language-specific challenges,
highlighting avenues for future improvements
☆ Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs
Hankun Dai, Maoquan Wang, Mengnan Qi, Yikai Zhang, Zijian Jin, Yongqiang Yao, Yufan Huang, Shengyu Fu, Elsie Nallipogu
Large language models (LLMs) are increasingly being applied to programming
tasks, ranging from single-turn code completion to autonomous agents. Current
code agent designs frequently depend on complex, hand-crafted workflows and
tool sets. However, this reliance on elaborate scaffolding presents several
challenges: agent performance becomes overly dependent on prompt tuning and
custom design choices, heavy human intervention obscures a model's true
underlying capabilities, and intricate pipelines are costly to build and
maintain. Furthermore, optimizing complex task prompts increases the risk of
data leakage. Currently, when introducing new models, LLM providers like OpenAI
and Anthropic often publish benchmark scores to demonstrate their models'
coding proficiency, but keep their proprietary evaluation frameworks
confidential. To address these limitations, we introduce Lita (Lite Agent),
which operationalizes liteness, a principle of minimizing manual design while
retaining the essential elements of a fully autonomous agent. Lita enables a
more faithful and unified evaluation without elaborate scaffolding. Experiments
on the Aider Polyglot and SWE-Bench with frontier models demonstrate that Lita
achieves competitive or superior performance compared to workflow-based and
agentic baselines. Crucially, Lita also consumes fewer tokens and requires
significantly less design effort. Our results suggest that Lita is sufficient
to reveal the underlying coding competence of modern LLMs. Finally, we propose
the Agent Complexity Law: the performance gap between agents of varying
complexity, from simple to sophisticated designs, will shrink as the core model
improves, ultimately converging to a negligible difference.
☆ ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
Yindong Wang, Martin Preiß, Margarita Bugueño, Jan Vincent Hoffbauer, Abdullatif Ghajar, Tolga Buz, Gerard de Melo
Large Language Models (LLMs) frequently confabulate scientific facts,severely
undermining their trustworthiness. Addressing this challenge requires
benchmarks that go beyond binary factuality and enable fine-grained evaluation.
We introduce \textbf{ReFACT} (\textit{Reddit False And Correct Texts}), a
benchmark of 1,001 expert-annotated question--answer pairs spanning diverse
scientific domains for the detection of scientific confabulation. Each instance
includes both a scientifically correct answer and a non-factual counterpart
annotated with \textbf{precise error spans and error-types}. ReFACT enables
multi-stage evaluation: (1) confabulation detection, (2) fine-grained error
localization, and (3) correction. We benchmark 9 state-of-the-art LLMs,
revealing limited performance ($\sim$50\% accuracy). Even top models such as
GPT-4o fail to distinguish factual from confabulated scientific answers,
raising concerns about the reliability of \textit{LLM-as-judge} evaluation
paradigms. Our findings highlight the need for fine-grained, human-validated
benchmarks to detect and correct scientific confabulation in domain-specific
contexts. Dataset is released on
\href{https://github.com/ddz5431/ReFACT}{GitHub}\footnote{We provide the
dataset at: https://github.com/ddz5431/ReFACT}.
☆ Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
Large Language Models (LLMs) can self-improve through reinforcement learning,
where they generate trajectories to explore and discover better solutions.
However, this exploration process is computationally expensive, often forcing
current methods to assign limited exploration budgets to each task. This
uniform allocation creates problematic edge cases: easy tasks consistently
succeed while difficult tasks consistently fail, both producing zero gradients
during training updates for the widely used Group Relative Policy Optimization
(GRPO). We address this problem from the lens of exploration budget allocation.
Viewing each task's exploration as an "item" with a distinct "value" and
"cost", we establish a connection to the classical knapsack problem. This
formulation allows us to derive an optimal assignment rule that adaptively
distributes resources based on the model's current learning status. When
applied to GRPO, our method increases the effective ratio of non-zero policy
gradients by 20-40% during training. Acting as a computational "free lunch",
our approach could reallocate exploration budgets from tasks where learning is
saturated to those where it is most impactful. This enables significantly
larger budgets (e.g., 93 rollouts) for especially challenging problems, which
would be computationally prohibitive under a uniform allocation. These
improvements translate to meaningful gains on mathematical reasoning
benchmarks, with average improvements of 2-4 points and peak gains of 9 points
on specific tasks. Notably, achieving comparable performance with traditional
homogeneous allocation would require about 2x the computational resources.
☆ Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations
When people query Vision-Language Models (VLMs) but cannot see the
accompanying visual context (e.g. for blind and low-vision users), augmenting
VLM predictions with natural language explanations can signal which model
predictions are reliable. However, prior work has found that explanations can
easily convince users that inaccurate VLM predictions are correct. To remedy
undesirable overreliance on VLM predictions, we propose evaluating two
complementary qualities of VLM-generated explanations via two quality scoring
functions. We propose Visual Fidelity, which captures how faithful an
explanation is to the visual context, and Contrastiveness, which captures how
well the explanation identifies visual details that distinguish the model's
prediction from plausible alternatives. On the A-OKVQA and VizWiz tasks, these
quality scoring functions are better calibrated with model correctness than
existing explanation qualities. We conduct a user study in which participants
have to decide whether a VLM prediction is accurate without viewing its visual
context. We observe that showing our quality scores alongside VLM explanations
improves participants' accuracy at predicting VLM correctness by 11.1%,
including a 15.4% reduction in the rate of falsely believing incorrect
predictions. These findings highlight the utility of explanation quality scores
in fostering appropriate reliance on VLM predictions.
☆ Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
While large reasoning models trained with critic-free reinforcement learning
and verifiable rewards (RLVR) represent the state-of-the-art, their practical
utility is hampered by ``overthinking'', a critical issue where models generate
excessively long reasoning paths without any performance benefit. Existing
solutions that penalize length often fail, inducing performance degradation due
to a fundamental misalignment between trajectory-level rewards and token-level
optimization. In this work, we introduce a novel framework, DECS, built on our
theoretical discovery of two previously unaddressed flaws in current length
rewards: (1) the erroneous penalization of essential exploratory tokens and (2)
the inadvertent rewarding of partial redundancy. Our framework's innovations
include (i) a first-of-its-kind decoupled token-level reward mechanism that
surgically distinguishes and penalizes redundant tokens, and (ii) a novel
curriculum batch scheduling strategy to master the efficiency-efficacy
equilibrium. Experimental results show DECS can achieve a dramatic reduction in
reasoning tokens by over 50\% across seven benchmarks while simultaneously
maintaining or even improving performance. It demonstrates conclusively that
substantial gains in reasoning efficiency can be achieved without compromising
a model's underlying reasoning power.
comment: 26 pages
☆ VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions EMNLP 2025
In this study, we focus on the automatic evaluation of long and detailed
image captions generated by multimodal Large Language Models (MLLMs). Most
existing automatic evaluation metrics for image captioning are primarily
designed for short captions and are not suitable for evaluating long captions.
Moreover, recent LLM-as-a-Judge approaches suffer from slow inference due to
their reliance on autoregressive inference and early fusion of visual
information. To address these limitations, we propose VELA, an automatic
evaluation metric for long captions developed within a novel
LLM-Hybrid-as-a-Judge framework. Furthermore, we propose LongCap-Arena, a
benchmark specifically designed for evaluating metrics for long captions. This
benchmark comprises 7,805 images, the corresponding human-provided long
reference captions and long candidate captions, and 32,246 human judgments from
three distinct perspectives: Descriptiveness, Relevance, and Fluency. We
demonstrated that VELA outperformed existing metrics and achieved superhuman
performance on LongCap-Arena.
comment: EMNLP 2025 Main Conference
☆ Personalized Scientific Figure Caption Generation: An Empirical Study on Author-Specific Writing Style Transfer
We study personalized figure caption generation using author profile data
from scientific papers. Our experiments demonstrate that rich author profile
data, combined with relevant metadata, can significantly improve the
personalization performance of multimodal large language models. However, we
also reveal a fundamental trade-off between matching author style and
maintaining caption quality. Our findings offer valuable insights and future
directions for developing practical caption automation systems that balance
both objectives. This work was conducted as part of the 3rd SciCap challenge.
☆ ReTAG: Retrieval-Enhanced, Topic-Augmented Graph-Based Global Sensemaking EMNLP 2025
Recent advances in question answering have led to substantial progress in
tasks such as multi-hop reasoning. However, global sensemaking-answering
questions by synthesizing information from an entire corpus remains a
significant challenge. A prior graph-based approach to global sensemaking lacks
retrieval mechanisms, topic specificity, and incurs high inference costs. To
address these limitations, we propose ReTAG, a Retrieval-Enhanced,
Topic-Augmented Graph framework that constructs topic-specific subgraphs and
retrieves the relevant summaries for response generation. Experiments show that
ReTAG improves response quality while significantly reducing inference time
compared to the baseline. Our code is available at
https://github.com/bykimby/retag.
comment: 9 pages, 5 figures, EMNLP 2025 Findings
☆ RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models
In recent years, large language models (LLMs) have demonstrated significant
potential across various natural language processing (NLP) tasks. However,
their performance in domain-specific applications and non-English languages
remains less explored. This study introduces a novel Romanian-language dataset
for multiple-choice biology questions, carefully curated to assess LLM
comprehension and reasoning capabilities in scientific contexts. Containing
approximately 14,000 questions, the dataset provides a comprehensive resource
for evaluating and improving LLM performance in biology.
We benchmark several popular LLMs, analyzing their accuracy, reasoning
patterns, and ability to understand domain-specific terminology and linguistic
nuances. Additionally, we perform comprehensive experiments to evaluate the
impact of prompt engineering, fine-tuning, and other optimization techniques on
model performance. Our findings highlight both the strengths and limitations of
current LLMs in handling specialized knowledge tasks in low-resource languages,
offering valuable insights for future research and development.
☆ Learning to Reason as Action Abstractions with Scalable Mid-Training RL
Large language models excel with reinforcement learning (RL), but fully
unlocking this potential requires a mid-training stage. An effective
mid-training phase should identify a compact set of useful actions and enable
fast selection among them through online RL. We formalize this intuition by
presenting the first theoretical result on how mid-training shapes
post-training: it characterizes an action subspace that minimizes both the
value approximation error from pruning and the RL error during subsequent
planning. Our analysis reveals two key determinants of mid-training
effectiveness: pruning efficiency, which shapes the prior of the initial RL
policy, and its impact on RL convergence, which governs the extent to which
that policy can be improved via online interactions. These results suggest that
mid-training is most effective when the decision space is compact and the
effective horizon is short, highlighting the importance of operating in the
space of action abstractions rather than primitive actions. Building on these
insights, we propose Reasoning as Action Abstractions (RA3), a scalable
mid-training algorithm. Specifically, we derive a sequential variational lower
bound and optimize it by iteratively discovering temporally-consistent latent
structures via RL, followed by fine-tuning on the bootstrapped data.
Experiments on code generation tasks demonstrate the effectiveness of our
approach. Across multiple base models, RA3 improves the average performance on
HumanEval and MBPP by 8 and 4 points over the base model and the next-token
prediction baseline. Furthermore, RA3 achieves faster convergence and higher
asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and
Codeforces.
☆ Assessing Algorithmic Bias in Language-Based Depression Detection: A Comparison of DNN and LLM Approaches
This paper investigates algorithmic bias in language-based models for
automated depression detection, focusing on socio-demographic disparities
related to gender and race/ethnicity. Models trained using deep neural networks
(DNN) based embeddings are compared to few-shot learning approaches with large
language models (LLMs), evaluating both performance and fairness on clinical
interview transcripts from the Distress Analysis Interview Corpus/Wizard-of-Oz
(DAIC-WOZ). To mitigate bias, fairness-aware loss functions are applied to
DNN-based models, while in-context learning with varied prompt framing and shot
counts is explored for LLMs. Results indicate that LLMs outperform DNN-based
models in depression classification, particularly for underrepresented groups
such as Hispanic participants. LLMs also exhibit reduced gender bias compared
to DNN-based embeddings, though racial disparities persist. Among
fairness-aware techniques for mitigating bias in DNN-based embeddings, the
worst-group loss, which is designed to minimize loss for the worst-performing
demographic group, achieves a better balance between performance and fairness.
In contrast, the fairness-regularized loss minimizes loss across all groups but
performs less effectively. In LLMs, guided prompting with ethical framing helps
mitigate gender bias in the 1-shot setting. However, increasing the number of
shots does not lead to further reductions in disparities. For race/ethnicity,
neither prompting strategy nor increasing $N$ in $N$-shot learning effectively
reduces disparities.
comment: 7 pages, 1 figure. This paper has been accepted to the IEEE-EMBS
International Conference on Biomedical and Health Informatics (BHI 2025),
Georgia Institute of Technology, Atlanta, Georgia, October 26-29, 2025
♻ ☆ The Impact of Language Mixing on Bilingual LLM Reasoning EMNLP 2025
Proficient multilingual speakers often intentionally switch languages in the
middle of a conversation. Similarly, recent reasoning-focused bilingual large
language models (LLMs) with strong capabilities in both languages exhibit
language mixing-alternating languages within their chain of thought.
Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy,
suggesting that language mixing may benefit reasoning. In this work, we study
language switching in Chinese-English bilingual reasoning models. We identify
reinforcement learning with verifiable rewards (RLVR) as the critical training
stage that leads to language mixing. We show that language mixing can enhance
reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage
points on MATH500. Additionally, a lightweight probe can be trained to predict
whether a potential language switch would benefit or harm reasoning, and when
used to guide decoding, increases accuracy by 2.92 percentage points. Our
findings suggest that language mixing is not merely a byproduct of multilingual
training, but is a strategic reasoning behavior.
comment: Accepted at EMNLP 2025 (Main Conference)
♻ ☆ Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
The impact of misinformation arises not only from factual inaccuracies but
also from the misleading narratives that creators deliberately embed.
Interpreting such creator intent is therefore essential for multimodal
misinformation detection (MMD) and effective information governance. To this
end, we introduce DeceptionDecoded, a large-scale benchmark of 12,000
image-caption pairs grounded in trustworthy reference articles, created using
an intent-guided simulation framework that models both the desired influence
and the execution plan of news creators. The dataset captures both misleading
and non-misleading cases, spanning manipulations across visual and textual
modalities, and supports three intent-centric tasks: (1) misleading intent
detection, (2) misleading source attribution, and (3) creator desire inference.
We evaluate 14 state-of-the-art vision-language models (VLMs) and find that
they struggle with intent reasoning, often relying on shallow cues such as
surface-level alignment, stylistic polish, or heuristic authenticity signals.
These results highlight the limitations of current VLMs and position
DeceptionDecoded as a foundation for developing intent-aware models that go
beyond shallow cues in MMD.
♻ ☆ MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction NeurIPS 2025
Sepideh Abedini, Shubhankar Mohapatra, D. B. Emerson, Masoumeh Shafieinejad, Jesse C. Cresswell, Xi He
Large language models (LLMs) have shown promising performance on tasks that
require reasoning, such as text-to-SQL, code generation, and debugging.
However, regulatory frameworks with strict privacy requirements constrain their
integration into sensitive systems. State-of-the-art LLMs are also proprietary,
costly, and resource-intensive, making local deployment impractical.
Consequently, utilizing such LLMs often requires sharing data with third-party
providers, raising privacy concerns and risking noncompliance with regulations.
Although fine-tuned small language models (SLMs) can outperform LLMs on certain
tasks and be deployed locally to mitigate privacy concerns, they underperform
on more complex tasks such as text-to-SQL translation. In this work, we
introduce MaskSQL, a text-to-SQL framework that utilizes abstraction as a
privacy protection mechanism to mask sensitive information in LLM prompts.
Unlike redaction, which removes content entirely, or generalization, which
broadens tokens, abstraction retains essential information while discarding
unnecessary details, striking an effective privacy-utility balance for the
text-to-SQL task. Moreover, by providing mechanisms to control the
privacy-utility tradeoff, MaskSQL facilitates adoption across a broader range
of use cases. Our experimental results show that MaskSQL outperforms leading
SLM-based text-to-SQL models and achieves performance approaching
state-of-the-art LLM-based models, while preserving privacy.
comment: Accepted to the 3rd Workshop on Regulatable ML at NeurIPS 2025
♻ ☆ VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated
success in enhancing LLM reasoning capabilities, but remains limited to
single-turn interactions without tool integration. While recent Agentic
Reinforcement Learning with Tool use (ARLT) approaches have emerged to address
multi-turn tool interactions, existing works develop task-specific codebases
that suffer from fragmentation, synchronous execution bottlenecks, and limited
extensibility across domains. These inefficiencies hinder broader community
adoption and algorithmic innovation. We introduce VerlTool, a unified and
modular framework that addresses these limitations through systematic design
principles. VerlTool provides four key contributions: (1) upstream alignment
with VeRL ensuring compatibility and simplified maintenance, (2) unified tool
management via standardized APIs supporting diverse modalities including code
execution, search, SQL databases, and vision processing, (3) asynchronous
rollout execution achieving near 2$\times$ speedup by eliminating
synchronization bottlenecks, and (4) comprehensive evaluation demonstrating
competitive performance across 6 ARLT domains. Our framework formalizes ARLT as
multi-turn trajectories with multi-modal observation tokens (text/image/video),
extending beyond single-turn RLVR paradigms. We train and evaluate models on
mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web
search, and software engineering tasks, achieving results comparable to
specialized systems while providing unified training infrastructure. The
modular plugin architecture enables rapid tool integration requiring only
lightweight Python definitions, significantly reducing development overhead and
providing a scalable foundation for tool-augmented RL research. Our code is
open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.
comment: 32 pages, 5 figures, 13 tables
♻ ☆ AutoJudge: Judge Decoding Without Manual Annotation
Roman Garipov, Fedor Velikonivtsev, Ivan Ermakov, Ruslan Svirschevski, Vage Egiazarian, Max Ryabinin
We introduce AutoJudge, a method that accelerates large language model (LLM)
inference with task-specific lossy speculative decoding. Instead of matching
the original model output distribution token-by-token, we identify which of the
generated tokens affect the downstream quality of the response, relaxing the
distribution match guarantee so that the "unimportant" tokens can be generated
faster. Our approach relies on a semi-greedy search algorithm to test which of
the mismatches between target and draft models should be corrected to preserve
quality and which ones may be skipped. We then train a lightweight classifier
based on existing LLM embeddings to predict, at inference time, which
mismatching tokens can be safely accepted without compromising the final answer
quality. We evaluate the effectiveness of AutoJudge with multiple draft/target
model pairs on mathematical reasoning and programming benchmarks, achieving
significant speedups at the cost of a minor accuracy reduction. Notably, on
GSM8k with the Llama 3.1 70B target model, our approach achieves up to
$\approx2\times$ speedup over speculative decoding at the cost of $\le 1\%$
drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge
automatically detects programming-specific important tokens, accepting $\ge 25$
tokens per speculation cycle at $2\%$ drop in Pass@1. Our approach requires no
human annotation and is easy to integrate with modern LLM inference frameworks.
comment: Preprint
♻ ☆ MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning
Multi-agent systems (MAS), leveraging the remarkable capabilities of Large
Language Models (LLMs), show great potential in addressing complex tasks. In
this context, integrating MAS with legal tasks is a crucial step. While
previous studies have developed legal benchmarks for LLM agents, none are
specifically designed to consider the unique advantages of MAS, such as task
decomposition, agent specialization, and flexible training. In fact, the lack
of evaluation methods limits the potential of MAS in the legal domain. To
address this gap, we propose MASLegalBench, a legal benchmark tailored for MAS
and designed with a deductive reasoning approach. Our benchmark uses GDPR as
the application scenario, encompassing extensive background knowledge and
covering complex reasoning processes that effectively reflect the intricacies
of real-world legal situations. Furthermore, we manually design various
role-based MAS and conduct extensive experiments using different
state-of-the-art LLMs. Our results highlight the strengths, limitations, and
potential areas for improvement of existing models and MAS architectures.
♻ ☆ LoLA: Low-Rank Linear Attention With Sparse Caching
The per-token cost of transformer inference scales with context length,
preventing its application to lifelong in-context learning. Linear attention is
an efficient alternative that maintains a constant memory footprint, even on
infinite context lengths. While this is a potential candidate for lifelong
learning, it falls short in memory capacity. In this paper, we propose LoLA, a
training-free augmentation to linear attention that boosts associative recall.
LoLA distributes past key-value pairs from context into three memory systems:
(i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize
pairs in a sparse, global cache; and (iii) generic pairs in the recurrent
hidden state of linear attention. We show through ablations that our
self-recall error metric is crucial to efficiently manage long-term associative
memories. On pass-key retrieval tasks, LoLA improves the base model's
performance from 0.6% to 97.4% accuracy. This is achieved with a 4.6x smaller
cache than Llama-3.1 8B on 4K context length. LoLA also outperforms other 1B
and 8B parameter subquadratic models on zero-shot commonsense reasoning tasks.
♻ ☆ GIM: Improved Interpretability for Large Language Models
Ensuring faithful interpretability in large language models is imperative for
trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where
networks compensate for reduced signal in one component by amplifying others,
masking the true importance of the ablated component. While prior work
attributes self-repair to layer normalization and back-up components that
compensate for ablated components, we identify a novel form occurring within
the attention mechanism, where softmax redistribution conceals the influence of
important attention scores. This leads traditional ablation and gradient-based
methods to underestimate the significance of all components contributing to
these attention scores. We introduce Gradient Interaction Modifications (GIM),
a technique that accounts for self-repair during backpropagation. Extensive
experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B,
Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves
faithfulness over existing circuit identification and feature attribution
methods. Our work is a significant step toward better understanding the inner
mechanisms of LLMs, which is crucial for improving them and ensuring their
safety. Our code is available at https://github.com/JoakimEdin/gim.
♻ ☆ One ruler to measure them all: Benchmarking multilingual long-context language models
We present ONERULER, a multilingual benchmark designed to evaluate
long-context language models across 26 languages. ONERULER adapts the
English-only RULER benchmark (Hsieh et al., 2024) by including seven synthetic
tasks that test both retrieval and aggregation, including new variations of the
"needle-in-a-haystack" task that allow for the possibility of a nonexistent
needle. We create ONERULER through a two-step process, first writing English
instructions for each task and then collaborating with native speakers to
translate them into 25 additional languages. Experiments with both open-weight
and closed LLMs reveal a widening performance gap between low- and
high-resource languages as context length increases from 8K to 128K tokens.
Surprisingly, English is not the top-performing language on long-context tasks
(ranked 6th out of 26), with Polish emerging as the top language. Our
experiments also show that many LLMs (particularly OpenAI's o3-mini-high)
incorrectly predict the absence of an answer, even in high-resource languages.
Finally, in cross-lingual scenarios where instructions and context appear in
different languages, performance can fluctuate by up to 20% depending on the
instruction language. We hope the release of ONERULER will facilitate future
research into improving multilingual and cross-lingual long-context training
pipelines.
♻ ☆ Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations
Mohammed Alkhowaiter, Norah Alshahrani, Saied Alshahrani, Reem I. Masoud, Alaa Alzahrani, Deema Alnuhait, Emad A. Alghamdi, Khalid Almubarak
Post-training has emerged as a crucial technique for aligning pre-trained
Large Language Models (LLMs) with human instructions, significantly enhancing
their performance across a wide range of tasks. Central to this process is the
quality and diversity of post-training datasets. This paper presents a review
of publicly available Arabic post-training datasets on the Hugging Face Hub,
organized along four key dimensions: (1) LLM Capabilities (e.g., Question
Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation,
and Function Calling); (2) Steerability (e.g., Persona and System Prompts); (3)
Alignment (e.g., Cultural, Safety, Ethics, and Fairness); and (4) Robustness.
Each dataset is rigorously evaluated based on popularity, practical adoption,
recency and maintenance, documentation and annotation quality, licensing
transparency, and scientific contribution. Our review revealed critical gaps in
the development of Arabic post-training datasets, including limited task
diversity, inconsistent or missing documentation and annotation, and low
adoption across the community. Finally, the paper discusses the implications of
these gaps on the progress of Arabic-centric LLMs and applications while
providing concrete recommendations for future efforts in Arabic post-training
dataset development.
♻ ☆ Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution EMNLP 2025
Model NLP models are commonly trained (or fine-tuned) on datasets from
untrusted platforms like HuggingFace, posing significant risks of data
poisoning attacks. A practical yet underexplored challenge arises when such
backdoors are discovered after model deployment, making retraining-required
defenses less desirable due to computational costs and data constraints. In
this work, we propose Guided Module Substitution (GMS), an effective
retraining-free method based on guided merging of the victim model with just a
single proxy model. Unlike prior ad-hoc merging defenses, GMS uses a guided
trade-off signal between utility and backdoor to selectively replaces modules
in the victim model. GMS offers four desirable properties: (1) robustness to
the choice and trustworthiness of the proxy model, (2) applicability under
inaccurate data knowledge, (3) stability across hyperparameters, and (4)
transferability across different attacks. Extensive experiments on encoder
models and decoder LLMs demonstrate the strong effectiveness of GMS. GMS
significantly outperforms even the strongest defense baseline, particularly
against challenging attacks like LWS.
comment: Accepted to Findings of the Association for Computational
Linguistics: EMNLP 2025
♻ ☆ Medical Question Summarization with Entity-driven Contrastive Learning
By summarizing longer consumer health questions into shorter and essential
ones, medical question-answering systems can more accurately understand
consumer intentions and retrieve suitable answers. However, medical question
summarization is very challenging due to obvious distinctions in health trouble
descriptions from patients and doctors. Although deep learning has been applied
to successfully address the medical question summarization (MQS) task, two
challenges remain: how to correctly capture question focus to model its
semantic intention, and how to obtain reliable datasets to fairly evaluate
performance. To address these challenges, this paper proposes a novel medical
question summarization framework based on entity-driven contrastive learning
(ECL). ECL employs medical entities present in frequently asked questions
(FAQs) as focuses and devises an effective mechanism to generate hard negative
samples. This approach compels models to focus on essential information and
consequently generate more accurate question summaries. Furthermore, we have
discovered that some MQS datasets, such as the iCliniq dataset with a 33%
duplicate rate, have significant data leakage issues. To ensure an impartial
evaluation of the related methods, this paper carefully examines leaked samples
to reorganize more reasonable datasets. Extensive experiments demonstrate that
our ECL method outperforms the existing methods and achieves new
state-of-the-art performance, i.e., 52.85, 43.16, 41.31, 43.52 in terms of
ROUGE-1 metric on MeQSum, CHQ-Summ, iCliniq, HealthCareMagic dataset,
respectively. The code and datasets are available at
https://github.com/yrbobo/MQS-ECL.
♻ ☆ A quantitative analysis of semantic information in deep representations of text and images
Deep neural networks are known to develop similar representations for
semantically related data, even when they belong to different domains, such as
an image and its description, or the same text in different languages. We
present a method for quantitatively investigating this phenomenon by measuring
the relative information content of the representations of semantically related
data and probing how it is encoded into multiple tokens of large language
models (LLMs) and vision transformers. Looking first at how LLMs process pairs
of translated sentences, we identify inner ``semantic'' layers containing the
most language-transferable information. We find moreover that, on these layers,
a larger LLM (DeepSeek-V3) extracts significantly more general information than
a smaller one (Llama3.1-8B). Semantic information of English text is spread
across many tokens and it is characterized by long-distance correlations
between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We
also identify layers encoding semantic information within visual transformers.
We show that caption representations in the semantic layers of LLMs predict
visual representations of the corresponding images. We observe significant and
model-dependent information asymmetries between image and text representations.
♻ ☆ Scalable LLM Math Reasoning Acceleration with Low-rank Distillation
Due to long generations, large language model (LLM) math reasoning demands
significant computational resources and time. While many existing efficient
inference methods have been developed with excellent performance preservation
on language tasks, they often severely degrade math performance. In this paper,
we propose Caprese, a resource-efficient distillation method to recover lost
capabilities from deploying efficient inference methods, focused primarily in
feedforward blocks. With original weights unperturbed, roughly 1% of additional
parameters, and only 20K synthetic training samples, we are able to recover
much if not all of the math capabilities lost from efficient inference for
thinking LLMs and without harm to language tasks for instruct LLMs. Moreover,
Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and
Llama 3.1 8B) and integrates cleanly into existing model layers to reduce
latency (>16% time-to-next-token reduction) while encouraging response brevity
(up to 8.5% fewer tokens).
♻ ☆ Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang
Large language models (LLMs) exhibit strong capabilities as decision-making
agents by interleaving reasoning and actions, as seen in ReAct-style
frameworks. Yet, their practical deployment is constrained by high inference
costs and large model sizes. We propose Structured Agent Distillation, a
framework that compresses large LLM-based agents into smaller student models
while preserving both reasoning fidelity and action consistency. Unlike
standard token-level distillation, our method segments trajectories into
{[REASON]} and {[ACT]} spans, applying segment-specific losses to align each
component with the teacher's behavior. This structure-aware supervision enables
compact agents to better replicate the teacher's decision process. Experiments
on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently
outperforms token-level and imitation learning baselines, achieving significant
compression with minimal performance drop. Scaling and ablation results further
highlight the importance of span-level alignment for efficient and deployable
agents.
♻ ☆ Value Profiles for Encoding Human Variation EMNLP 2025
Taylor Sorensen, Pushkar Mishra, Roma Patel, Michael Henry Tessler, Michiel Bakker, Georgina Evans, Iason Gabriel, Noah Goodman, Verena Rieser
Modelling human variation in rating tasks is crucial for personalization,
pluralistic model alignment, and computational social science. We propose
representing individuals using natural language value profiles -- descriptions
of underlying values compressed from in-context demonstrations -- along with a
steerable decoder model that estimates individual ratings from a rater
representation. To measure the predictive information in a rater
representation, we introduce an information-theoretic methodology and find that
demonstrations contain the most information, followed by value profiles, then
demographics. However, value profiles effectively compress the useful
information from demonstrations (>70% information preservation) and offer
advantages in terms of scrutability, interpretability, and steerability.
Furthermore, clustering value profiles to identify similarly behaving
individuals better explains rater variation than the most predictive
demographic groupings. Going beyond test set performance, we show that the
decoder predictions change in line with semantic profile differences, are
well-calibrated, and can help explain instance-level disagreement by simulating
an annotator population. These results demonstrate that value profiles offer
novel, predictive ways to describe individual variation beyond demographics or
group information.
comment: EMNLP 2025
♻ ☆ Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models NeurIPS 2025
Existing large language models (LLMs) face challenges of following complex
instructions, especially when multiple constraints are present and organized in
paralleling, chaining, and branching structures. One intuitive solution, namely
chain-of-thought (CoT), is expected to universally improve capabilities of
LLMs. However, we find that the vanilla CoT exerts a negative impact on
performance due to its superficial reasoning pattern of simply paraphrasing the
instructions. It fails to peel back the compositions of constraints for
identifying their relationship across hierarchies of types and dimensions. To
this end, we propose RAIF, a systematic method to boost LLMs in dealing with
complex instructions via incentivizing reasoning for test-time compute scaling.
First, we stem from the decomposition of complex instructions under existing
taxonomies and propose a reproducible data acquisition method. Second, we
exploit reinforcement learning (RL) with verifiable rule-centric reward signals
to cultivate reasoning specifically for instruction following. We address the
shallow, non-essential nature of reasoning under complex instructions via
sample-wise contrast for superior CoT enforcement. We also exploit behavior
cloning of experts to facilitate steady distribution shift from fast-thinking
LLMs to skillful reasoners. Extensive evaluations on seven comprehensive
benchmarks confirm the validity of the proposed method, where a 1.5B LLM
achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on
OOD constraints also confirms the generalizability of our RAIF. Codes and data
are available at https://github.com/yuleiqin/RAIF.
Keywords: reinforcement learning with verifiable rewards (RLVR), instruction
following, complex instructions
comment: Accepted to NeurIPS 2025; 15 pages of main body, 5 tables, 5 figures,
42 pages of appendix
♻ ☆ SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?
Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Seong Joon Oh, Sinead Williamson
The common approach to communicate a large language model's (LLM) uncertainty
is to add a percentage number or a hedging word to its response. But is this
all we can do? Instead of generating a single answer and then hedging it, an
LLM that is fully transparent to the user needs to be able to reflect on its
internal belief distribution and output a summary of all options it deems
possible, and how likely they are. To test whether LLMs possess this
capability, we develop the SelfReflect metric, an information-theoretic
distance between a given summary and a distribution over answers. In
interventional and human studies, we find that SelfReflect indicates even
slight deviations, yielding a fine measure of faithfulness between a summary
string and an LLM's actual internal distribution over answers. With
SelfReflect, we make a resounding negative observation: modern LLMs are, across
the board, incapable of revealing what they are uncertain about, neither
through reasoning, nor chains-of-thoughts, nor explicit finetuning. However, we
do find that LLMs are able to generate faithful summaries of their
uncertainties if we help them by sampling multiple outputs and feeding them
back into the context. This simple approach shines a light at the universal way
of communicating LLM uncertainties whose future development the SelfReflect
score enables.
♻ ☆ Frankentext: Stitching random text fragments into long-form narratives
We introduce Frankentexts, a long-form narrative generation paradigm that
treats an LLM as a composer of existing texts rather than as an author. Given a
writing prompt and thousands of randomly sampled human-written snippets, the
model is asked to produce a narrative under the extreme constraint that most
tokens (e.g., 90%) must be copied verbatim from the provided paragraphs. This
task is effectively intractable for humans: selecting and ordering snippets
yields a combinatorial search space that an LLM implicitly explores, before
minimally editing and stitching together selected fragments into a coherent
long-form story. Despite the extreme challenge of the task, we observe through
extensive automatic and human evaluation that Frankentexts significantly
improve over vanilla LLM generations in terms of writing quality, diversity,
and originality while remaining coherent and relevant to the prompt.
Furthermore, Frankentexts pose a fundamental challenge to detectors of
AI-generated text: 72% of Frankentexts produced by our best Gemini 2.5 Pro
configuration are misclassified as human-written by Pangram, a state-of-the-art
detector. Human annotators praise Frankentexts for their inventive premises,
vivid descriptions, and dry humor; on the other hand, they identify issues with
abrupt tonal shifts and uneven grammar across segments, particularly in longer
pieces. The emergence of high-quality Frankentexts raises serious questions
about authorship and copyright: when humans provide the raw materials and LLMs
orchestrate them into new narratives, who truly owns the result?
♻ ☆ AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Shun Zhang, Xingjian Du, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Gelei Deng, Haoyang Li, Yiming Li, Xiaobin Zhuang, Tianlong Chen, Qingsong Wen, Tianwei Zhang, Yang Liu, Haibo Hu, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, Wenyuan Xu, XiaoFeng Wang, Wei Dong, Xinfeng Li
Audio Large Language Models (ALLMs) have gained widespread adoption, yet
their trustworthiness remains underexplored. Existing evaluation frameworks,
designed primarily for text, fail to address unique vulnerabilities introduced
by audio's acoustic properties. We identify significant trustworthiness risks
in ALLMs arising from non-semantic acoustic cues, including timbre, accent, and
background noise, which can manipulate model behavior. We propose AudioTrust, a
comprehensive framework for systematic evaluation of ALLM trustworthiness
across audio-specific risks. AudioTrust encompasses six key dimensions:
fairness, hallucination, safety, privacy, robustness, and authentication. The
framework implements 26 distinct sub-tasks using a curated dataset of over
4,420 audio samples from real-world scenarios, including daily conversations,
emergency calls, and voice assistant interactions. We conduct comprehensive
evaluations across 18 experimental configurations using human-validated
automated pipelines. Our evaluation of 14 state-of-the-art open-source and
closed-source ALLMs reveals significant limitations when confronted with
diverse high-risk audio scenarios, providing insights for secure deployment of
audio models. Code and data are available at
https://github.com/JusperLee/AudioTrust.
comment: Technical Report
♻ ☆ Should You Use Your Large Language Model to Explore or Exploit?
We evaluate the ability of the current generation of large language models
(LLMs) to help a decision-making agent facing an exploration-exploitation
tradeoff. We use LLMs to explore and exploit in silos in various (contextual)
bandit tasks. We find that while the current LLMs often struggle to exploit,
in-context mitigations may be used to substantially improve performance for
small-scale tasks. However even then, LLMs perform worse than a simple linear
regression. On the other hand, we find that LLMs do help at exploring large
action spaces with inherent semantics, by suggesting suitable candidates to
explore.
♻ ☆ Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions EMNLP 2025
Large language models, despite extensive alignment with human values and
ethical principles, remain vulnerable to sophisticated jailbreak attacks that
exploit their reasoning abilities. Existing safety measures often detect overt
malicious intent but fail to address subtle, reasoning-driven vulnerabilities.
In this work, we introduce POATE (Polar Opposite query generation, Adversarial
Template construction, and Elaboration), a novel jailbreak technique that
harnesses contrastive reasoning to provoke unethical responses. POATE crafts
semantically opposing intents and integrates them with adversarial templates,
steering models toward harmful outputs with remarkable subtlety. We conduct
extensive evaluation across six diverse language model families of varying
parameter sizes to demonstrate the robustness of the attack, achieving
significantly higher attack success rates (~44%) compared to existing methods.
To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which
decompose queries to detect malicious intent and reason in reverse to evaluate
and reject harmful responses. These methods enhance reasoning robustness and
strengthen the model's defense against adversarial exploits.
comment: Accepted at EMNLP 2025 (Main)
♻ ☆ Scaling RL to Long Videos NeurIPS 2025
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
We introduce a full-stack framework that scales up reasoning in
vision-language models (VLMs) to long videos, leveraging reinforcement
learning. We address the unique challenges of long video reasoning by
integrating three critical components: (1) a large-scale dataset,
LongVideo-Reason, comprising 104K long video QA pairs with high-quality
reasoning annotations across diverse domains such as sports, games, and vlogs;
(2) a two-stage training pipeline that extends VLMs with chain-of-thought
supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a
training infrastructure for long video RL, named Multi-modal Reinforcement
Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a
vLLM-based engine tailored for long video, using cached video embeddings for
efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves
strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on
VideoMME without and with subtitles, respectively, and consistently
outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B
supports processing up to 8,192 video frames per video, and configurable FPS
settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video
RL training. In addition, we release our training system for public
availability that supports RL training on various modalities (video, text, and
audio), various models (VILA and Qwen series), and even image and video
generation models. On a single A100 node (8 GPUs), it supports RL training on
hour-long videos (e.g., 3,600 frames).
comment: Accepted by NeurIPS 2025. Code at https://github.com/NVlabs/Long-RL
and model at https://huggingface.co/Efficient-Large-Model/LongVILA-R1-7B
♻ ☆ CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
Reinforcement learning (RL) has become a powerful paradigm for optimizing
large language models (LLMs) to handle complex reasoning tasks. A core
challenge in this process lies in managing policy entropy, which reflects the
balance between exploration and exploitation during training. Existing methods,
such as proximal policy optimization (PPO) and its variants, discard valuable
gradient signals from low-probability tokens due to the clipping mechanism. We
systematically analyze the entropy dynamics and reveal that these clipped
tokens play a critical yet overlooked role in regulating entropy evolution. We
propose \textbf{C}oordinating \textbf{E}ntropy via
\textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization
(CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in
native PPO in a gentle and bounded manner. By controlling the magnitude of
gradients from tokens outside the clipping interval, CE-GPPO is able to achieve
an exploration-exploitation trade-off. We provide theoretical justification and
empirical evidence showing that CE-GPPO effectively mitigates entropy
instability. Extensive experiments on mathematical reasoning benchmarks show
that CE-GPPO consistently outperforms strong baselines across different model
scales.
♻ ☆ BianCang: A Traditional Chinese Medicine Large Language Model
Sibo Wei, Xueping Peng, Yi-Fei Wang, Tao Shen, Jiasheng Si, Weiyu Zhang, Fa Zhu, Athanasios V. Vasilakos, Wenpeng Lu, Xiaoming Wu, Yinglong Wang
The surge of large language models (LLMs) has driven significant progress in
medical applications, including traditional Chinese medicine (TCM). However,
current medical LLMs struggle with TCM diagnosis and syndrome differentiation
due to substantial differences between TCM and modern medical theory, and the
scarcity of specialized, high-quality corpora. To this end, in this paper we
propose BianCang, a TCM-specific LLM, using a two-stage training process that
first injects domain-specific knowledge and then aligns it through targeted
stimulation to enhance diagnostic and differentiation capabilities.
Specifically, we constructed pre-training corpora, instruction-aligned datasets
based on real hospital records, and the ChP-TCM dataset derived from the
Pharmacopoeia of the People's Republic of China. We compiled extensive TCM and
medical corpora for continual pre-training and supervised fine-tuning, building
a comprehensive dataset to refine the model's understanding of TCM. Evaluations
across 11 test sets involving 31 models and 4 tasks demonstrate the
effectiveness of BianCang, offering valuable insights for future research.
Code, datasets, and models are available on
https://github.com/QLU-NLP/BianCang.
♻ ☆ Value-Guided Search for Efficient Chain-of-Thought Reasoning NeurIPS 2025
In this paper, we propose a simple and efficient method for value model
training on long-context reasoning traces. Compared to existing process reward
models (PRMs), our method does not require a fine-grained notion of "step,"
which is difficult to define for long-context reasoning models. By collecting a
dataset of 2.5 million reasoning traces, we train a 1.5B token-level value
model and apply it to DeepSeek models for improved performance with test-time
compute scaling. We find that block-wise value-guided search (VGS) with a final
weighted majority vote achieves better test-time scaling than standard methods
such as majority voting or best-of-n. Moreover, VGS significantly reduces the
inference FLOPs required to achieve the same performance of majority voting.
Our dataset, model and codebase are open-sourced.
comment: NeurIPS 2025
♻ ☆ Preemptive Detection and Correction of Misaligned Actions in LLM Agents EMNLP 2025
Deploying LLM-based agents in real-life applications often faces a critical
challenge: the misalignment between agents' behavior and user intent. Such
misalignment may lead agents to unintentionally execute critical actions that
carry negative outcomes (e.g., accidentally triggering a "buy-now" in web
shopping), resulting in undesirable or even irreversible consequences. Although
addressing these issues is crucial, the preemptive detection and correction of
misaligned actions remains relatively underexplored. To fill this gap, we
introduce InferAct, a novel approach that leverages the belief reasoning
ability of LLMs, grounded in Theory-of-Mind, to detect misaligned actions
before execution. Once the misalignment is detected, InferAct alerts users for
timely correction, preventing adverse outcomes and enhancing the reliability of
LLM agents' decision-making processes. Experiments on three widely used tasks
demonstrate that InferAct achieves up to 20% improvements on Marco-F1 against
baselines in misaligned action detection. An in-depth evaluation of
misalignment correction further highlights InferAct's effectiveness in
improving agent alignment.
comment: Accepted by EMNLP 2025
♻ ☆ Voting or Consensus? Decision-Making in Multi-Agent Debate ACL2025
Much of the success of multi-agent debates depends on carefully choosing the
right parameters. The decision-making protocol stands out as it can highly
impact final model answers, depending on how decisions are reached. Systematic
comparison of decision protocols is difficult because many studies alter
multiple discussion parameters beyond the protocol. So far, it has been largely
unknown how decision-making influences different tasks. This work
systematically evaluates the impact of seven decision protocols (e.g., majority
voting, unanimity consensus). We change only one variable at a time - the
decision protocol - to analyze how different methods affect the collaboration
between agents and measure differences in knowledge and reasoning tasks. Our
results show that voting protocols improve performance by 13.2% in reasoning
tasks and consensus protocols by 2.8% in knowledge tasks compared to other
decision protocols. Increasing the number of agents improves performance, while
more discussion rounds before voting reduce it. To improve decision-making by
increasing answer diversity, we propose two new methods, All-Agents Drafting
(AAD) and Collective Improvement (CI). Our methods improve task performance by
up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the
importance of decision-making in multi-agent debates beyond scaling.
comment: Accepted at ACL2025 (Findings)
♻ ☆ Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$ EMNLP 2025
Retrieval-augmented generation (RAG) and long-context language models (LCLMs)
both address context limitations of LLMs in open-domain question answering
(QA). However, optimal external context to retrieve remains an open problem:
fixing the retrieval size risks either wasting tokens or omitting key evidence.
Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM
prompting and perform well on factoid QA, but struggle with aggregation QA,
where the optimal context size is both unknown and variable. We present
Adaptive-$k$ retrieval, a simple and effective single-pass method that
adaptively selects the number of passages based on the distribution of the
similarity scores between the query and the candidate passages. It does not
require model fine-tuning, extra LLM inferences or changes to existing
retriever-reader pipelines. On both factoid and aggregation QA benchmarks,
Adaptive-$k$ matches or outperforms fixed-$k$ baselines while using up to 10x
fewer tokens than full-context input, yet still retrieves 70% of relevant
passages. It improves accuracy across five LCLMs and two embedding models,
highlighting that dynamically adjusting context size leads to more efficient
and accurate QA.
comment: 26 pages, 16 tables, 5 figures. Accepted at EMNLP 2025 (Main)
♻ ☆ Towards Reasoning Ability of Small Language Models EMNLP 2025
Reasoning has long been viewed as an emergent property of large language
models (LLMs). However, recent studies challenge this assumption, showing that
small language models (SLMs) can also achieve competitive reasoning
performance. This paper introduces ThinkSLM, the first extensive benchmark to
systematically evaluate and study the reasoning abilities of SLMs trained from
scratch or derived from LLMs through quantization, pruning, and distillation.
We first establish a reliable evaluation criterion comparing available methods
and LLM judges against our human evaluations. Then we present a study
evaluating 72 diverse SLMs from six major model families across 17 reasoning
benchmarks. We repeat all our experiments three times to ensure a robust
assessment. Our findings show that: 1) reasoning ability in SLMs is strongly
influenced by training methods and data quality rather than solely model scale;
2) quantization preserves reasoning capability, while pruning significantly
disrupts it; 3) larger models consistently exhibit higher robustness against
adversarial perturbations and intermediate reasoning, but certain smaller
models closely match or exceed the larger models' performance. Our findings
challenge the assumption that scaling is the only way to achieve strong
reasoning. Instead, we foresee a future where SLMs with strong reasoning
capabilities can be developed through structured training or post-training
compression. Our ThinkSLM Leaderboard is publicly available at:
https://ctrl-gaurav.github.io/thinkslm.github.io/
comment: Accepted to EMNLP 2025 Main Conference
♻ ☆ SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions
Humans continuously infer the states, goals, and behaviors of others by
perceiving their surroundings in dynamic, real-world social interactions.
However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based
scenarios, which have a significant gap compared to real interactions. We
propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in
embodied multi-agent complex social interactions. This benchmark is based on
rich multimodal interaction data generated by the interaction environment SoMi,
covering diverse crafting goals and social relationships. Our framework
supports multi-level evaluation: (1) first-person evaluation provides
multimodal (visual, dialogue, action, etc.) input from a first-person
perspective during a task for real-time state inference, (2) third-person
evaluation provides complete third-person perspective video and text records
after a task for goal and behavior inference. This evaluation method allows for
a more comprehensive examination of a model's ToM capabilities from both the
subjective immediate experience and the objective global observation. We
constructed a challenging dataset containing 35 third-person perspective
videos, 363 first-person perspective images, and 1225 expert-annotated
multiple-choice questions (three options). On this dataset, we systematically
evaluated the performance of human subjects and several state-of-the-art large
vision-language models (LVLMs). The results show that LVLMs perform
significantly worse than humans on SoMi-ToM: the average accuracy gap between
humans and models is 40.1% in first-person evaluation and 26.4% in third-person
evaluation. This indicates that future LVLMs need to further improve their ToM
capabilities in embodied, complex social interactions.
comment: 24 pages, 6 figures
♻ ☆ Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling
Pengfei Wang, Baolin Sun, Xuemei Dong, Yaxun Dai, Hongwei Yuan, Mengdie Chu, Yingqi Gao, Xiang Qi, Peng Zhang, Ying Yan
State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind
human experts on challenging benchmarks like BIRD. Current approaches that
explore test-time scaling lack an orchestrated strategy and neglect the model's
internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL,
a novel framework leveraging scalable computation to improve performance.
Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that
synergistically combines three distinct perspectives: i) Internal Scaling via
RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative
Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament
Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy
adaptation to new databases and more powerful language models. Extensive
experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD
benchmark, reaching 81.67\% execution accuracy on the test set and ranking
first on the official leaderboard, demonstrating an effective path toward
human-level performance.
♻ ☆ DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning EMNLP 2025
Large language models (LLMs) have improved significantly in their reasoning
through extensive training on massive datasets. However, relying solely on
additional data for improvement is becoming increasingly impractical,
highlighting the need for models to autonomously enhance their reasoning
without external supervision. In this paper, we propose Debate, Train, Evolve
(DTE), a novel ground truth-free training framework that uses multi-agent
debate traces to evolve a single language model. We also introduce a new
prompting strategy Reflect-Critique-Refine, to improve debate quality by
explicitly instructing agents to critique and refine their reasoning. Extensive
evaluations on seven reasoning benchmarks with six open-weight models show that
our DTE framework achieve substantial improvements, with an average accuracy
gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe
strong cross-domain generalization, with an average accuracy gain of 5.8% on
all other benchmarks, suggesting that our method captures general reasoning
capabilities. Our framework code and trained models are publicly available at
https://github.com/ctrl-gaurav/Debate-Train-Evolve
comment: Accepted to EMNLP 2025 Main Conference
♻ ☆ ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding EMNLP2025
While Multimodal Large Language Models (MLLMs) have achieved remarkable
progress in open-ended visual question answering, they remain vulnerable to
hallucinations. These are outputs that contradict or misrepresent input
semantics, posing a critical challenge to the reliability and factual
consistency. Existing methods often rely on external verification or post-hoc
correction, lacking an internal mechanism to validate outputs directly during
training. To bridge this gap, we propose ReLoop, a unified closed-loop training
framework that encourages multimodal consistency for cross-modal understanding
in MLLMs. ReLoop adopts a ring-shaped structure that integrates three
complementary consistency feedback mechanisms, obliging MLLMs to "seeing twice
and thinking backwards". Specifically, ReLoop employs the frozen Consistency
Feedback Plugin (CFP), comprising semantic reconstruction, visual description,
and an attention supervision module for attention alignment. These components
collectively enforce semantic reversibility, visual consistency, and
interpretable attention, enabling the model to correct its outputs during
training. Extensive evaluations and analyses demonstrate the effectiveness of
ReLoop in reducing hallucination rates across multiple benchmarks, establishing
a robust method for hallucination mitigation in MLLMs. We will release our
source code and data in the camera-ready version.
comment: Accepted by conference EMNLP2025
♻ ☆ Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation
Despite remarkable progress in image quality and prompt fidelity,
text-to-image (T2I) diffusion models continue to exhibit persistent
"hallucinations", where generated content subtly or significantly diverges from
the intended prompt semantics. While often regarded as unpredictable artifacts,
we argue that these failures reflect deeper, structured misalignments within
the generative process. In this work, we propose a cognitively inspired
perspective that reinterprets hallucinations as trajectory drift within a
latent alignment space. Empirical observations reveal that generation unfolds
within a multiaxial cognitive tension field, where the model must continuously
negotiate competing demands across three key critical axes: semantic coherence,
structural alignment, and knowledge grounding. We then formalize this
three-axis space as the Hallucination Tri-Space and introduce the Alignment
Risk Code (ARC): a dynamic vector representation that quantifies real-time
alignment tension during generation. The magnitude of ARC captures overall
misalignment, its direction identifies the dominant failure axis, and its
imbalance reflects tension asymmetry. Based on this formulation, we develop the
TensionModulator (TM-ARC): a lightweight controller that operates entirely in
latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific
interventions during the sampling process. Extensive experiments on standard
T2I benchmarks demonstrate that our approach significantly reduces
hallucination without compromising image quality or diversity. This framework
offers a unified and interpretable approach for understanding and mitigating
generative failures in diffusion-based T2I systems.
comment: 9 pages,6 figures,7 tables
♻ ☆ Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models
The rapid advancements in large language models (LLMs) have significantly
improved their ability to generate natural language, making texts generated by
LLMs increasingly indistinguishable from human-written texts. While recent
research has primarily focused on using LLMs to classify text as either
human-written or machine-generated texts, our study focuses on characterizing
these texts using a set of linguistic features across different linguistic
levels such as morphology, syntax, and semantics. We select a dataset of
human-written and machine-generated texts spanning 8 domains and produced by 11
different LLMs. We calculate different linguistic features such as dependency
length and emotionality, and we use them for characterizing human-written and
machine-generated texts along with different sampling strategies, repetition
controls, and model release dates. Our statistical analysis reveals that
human-written texts tend to exhibit simpler syntactic structures and more
diverse semantic content. Furthermore, we calculate the variability of our set
of features across models and domains. Both human- and machine-generated texts
show stylistic diversity across domains, with human-written texts displaying
greater variation in our features. Finally, we apply style embeddings to
further test variability among human-written and machine-generated texts.
Notably, newer models output text that is similarly variable, pointing to a
homogenization of machine-generated texts.
comment: arXiv admin note: text overlap with arXiv:2412.03025
♻ ☆ Static Word Embeddings for Sentence Semantic Representation EMNLP 2025
We propose new static word embeddings optimised for sentence semantic
representation. We first extract word embeddings from a pre-trained Sentence
Transformer, and improve them with sentence-level principal component analysis,
followed by either knowledge distillation or contrastive learning. During
inference, we represent sentences by simply averaging word embeddings, which
requires little computational cost. We evaluate models on both monolingual and
cross-lingual tasks and show that our model substantially outperforms existing
static models on sentence semantic tasks, and even surpasses a basic Sentence
Transformer model (SimCSE) on a text embedding benchmark. Lastly, we perform a
variety of analyses and show that our method successfully removes word
embedding components that are not highly relevant to sentence semantics, and
adjusts the vector norms based on the influence of words on sentence semantics.
comment: 17 pages; accepted to the Main Conference of EMNLP 2025
♻ ☆ LFTR: Learning-Free Token Reduction for Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have demonstrated exceptional
success in various multimodal tasks, yet their deployment is frequently limited
by substantial computational demands and prolonged inference times. Given that
the vision modality typically contains more comprehensive information than the
text modality, resulting in encoded representations comprising an extensive
number of tokens, leading to significant computational overhead due to the
quadratic complexity of the attention mechanism. Current token reduction
methods are typically restricted to specific model architectures and often
necessitate extensive retraining or fine-tuning, restricting their
applicability to many state-of-the-art models. In this paper, we introduce a
learning-free token reduction (LFTR) method designed for MLLMs. LFTR can be
seamlessly integrated into most open-source MLLM architectures without
requiring additional fine-tuning. By capitalizing on the redundancy in visual
representations, our approach effectively reduces tokens while preserving the
general inference performance of MLLMs. We conduct experiments on multiple MLLM
architectures (LLaVA, MiniGPT, QwenVL), and our results show that LFTR achieves
up to a $16\times$ reduction of visual tokens while maintaining or even
enhancing performance on mainstream vision question-answering benchmarks, all
in a learning-free setting. Additionally, LFTR is complementary to other
acceleration techniques, such as vision encoder compression and post-training
quantization, further promoting the efficient deployment of MLLMs. Our project
is available at https://anonymous.4open.science/r/LFTR-AAAI-0528.
♻ ☆ A Position Paper on the Automatic Generation of Machine Learning Leaderboards
An important task in machine learning (ML) research is comparing prior work,
which is often performed via ML leaderboards: a tabular overview of experiments
with comparable conditions (e.g., same task, dataset, and metric). However, the
growing volume of literature creates challenges in creating and maintaining
these leaderboards. To ease this burden, researchers have developed methods to
extract leaderboard entries from research papers for automated leaderboard
curation. Yet, prior work varies in problem framing, complicating comparisons
and limiting real-world applicability. In this position paper, we present the
first overview of Automatic Leaderboard Generation (ALG) research, identifying
fundamental differences in assumptions, scope, and output formats. We propose
an ALG unified conceptual framework to standardise how the ALG task is defined.
We offer ALG benchmarking guidelines, including recommendations for datasets
and metrics that promote fair, reproducible evaluation. Lastly, we outline
challenges and new directions for ALG, such as, advocating for broader coverage
by including all reported results and richer metadata.
♻ ☆ Inducing Dyslexia in Vision Language Models
Dyslexia, a neurodevelopmental disorder characterized by persistent reading
difficulties, is often linked to reduced activity of the visual word form area
in the ventral occipito-temporal cortex. Traditional approaches to studying
dyslexia, such as behavioral and neuroimaging methods, have provided valuable
insights but remain limited in their ability to test causal hypotheses about
the underlying mechanisms of reading impairments. In this study, we use
large-scale vision-language models (VLMs) to simulate dyslexia by functionally
identifying and perturbing artificial analogues of word processing. Using
stimuli from cognitive neuroscience, we identify visual-word-form-selective
units within VLMs and demonstrate that targeted ablation of these units, unlike
ablation of random units, leads to selective impairments in reading tasks while
general visual and language comprehension abilities remain intact. In
particular, the resulting model matches dyslexic humans' phonological deficits
without a significant change in orthographic processing. Taken together, our
modeling results replicate key characteristics of dyslexia and establish a
computational framework for investigating reading disorders.
♻ ☆ Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages
This study investigates the efficacy of data augmentation techniques for
low-resource automatic speech recognition (ASR), focusing on two endangered
Austronesian languages, Amis and Seediq. Recognizing the potential of
self-supervised learning (SSL) in low-resource settings, we explore the impact
of data volume on the continued pre-training of SSL models. We propose a novel
data-selection scheme leveraging a multilingual corpus to augment the limited
target language data. This scheme utilizes a language classifier to extract
utterance embeddings and employs one-class classifiers to identify utterances
phonetically and phonologically proximate to the target languages. Utterances
are ranked and selected based on their decision scores, ensuring the inclusion
of highly relevant data in the SSL-ASR pipeline. Our experimental results
demonstrate the effectiveness of this approach, yielding substantial
improvements in ASR performance for both Amis and Seediq. These findings
underscore the feasibility and promise of data augmentation through
cross-lingual transfer learning for low-resource language ASR.
comment: Accepted to O-COCOSDA 2025
♻ ☆ K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning via Knowledge Decoupling
Continual Structured Knowledge Reasoning (CSKR) focuses on training models to
handle sequential tasks, where each task involves translating natural language
questions into structured queries grounded in structured knowledge. Existing
general continual learning approaches face significant challenges when applied
to this task, including poor generalization to heterogeneous structured
knowledge and inefficient reasoning due to parameter growth as tasks increase.
To address these limitations, we propose a novel CSKR framework,
\textsc{K-DeCore}, which operates with a fixed number of tunable parameters.
Unlike prior methods, \textsc{K-DeCore} introduces a knowledge decoupling
mechanism that disentangles the reasoning process into task-specific and
task-agnostic stages, effectively bridging the gaps across diverse tasks.
Building on this foundation, \textsc{K-DeCore} integrates a dual-perspective
memory consolidation mechanism for distinct stages and introduces a
structure-guided pseudo-data synthesis strategy to further enhance the model's
generalization capabilities. Extensive experiments on four benchmark datasets
demonstrate the superiority of \textsc{K-DeCore} over existing continual
learning methods across multiple metrics, leveraging various backbone large
language models.
comment: Accepted in Neurips 2025 (poster)
♻ ☆ Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework ACL 2025
Jundong Xu, Hao Fei, Meng Luo, Qian Liu, Liangming Pan, William Yang Wang, Preslav Nakov, Mong-Li Lee, Wynne Hsu
In the context of large language models (LLMs), current advanced reasoning
methods have made impressive strides in various reasoning tasks. However, when
it comes to logical reasoning tasks, major challenges remain in both efficacy
and efficiency. This is rooted in the fact that these systems fail to fully
leverage the inherent structure of logical tasks throughout the reasoning
processes such as decomposition, search, and resolution. To address this, we
propose a logic-complete reasoning framework, Aristotle, with three key
components: Logical Decomposer, Logical Search Router, and Logical Resolver. In
our framework, symbolic expressions and logical rules are comprehensively
integrated into the entire reasoning process, significantly alleviating the
bottlenecks of logical reasoning, i.e., reducing sub-task complexity,
minimizing search errors, and resolving logical contradictions. The
experimental results on several datasets demonstrate that Aristotle
consistently outperforms state-of-the-art reasoning frameworks in both accuracy
and efficiency, particularly excelling in complex logical reasoning scenarios.
We will open-source all our code at https://llm-symbol.github.io/Aristotle/.
comment: Accepted to ACL 2025 (Oral)
♻ ☆ Regularizing Learnable Feature Extraction for Automatic Speech Recognition
Neural front-ends are an appealing alternative to traditional, fixed feature
extraction pipelines for automatic speech recognition (ASR) systems since they
can be directly trained to fit the acoustic model. However, their performance
often falls short compared to classical methods, which we show is largely due
to their increased susceptibility to overfitting. This work therefore
investigates regularization methods for training ASR models with learnable
feature extraction front-ends. First, we examine audio perturbation methods and
show that larger relative improvements can be obtained for learnable features.
Additionally, we identify two limitations in the standard use of SpecAugment
for these front-ends and propose masking in the short time Fourier transform
(STFT)-domain as a simple but effective modification to address these
challenges. Finally, integrating both regularization approaches effectively
closes the performance gap between traditional and learnable features.
comment: Proceedings of Interspeech 2025
♻ ☆ Using Knowledge Graphs to harvest datasets for efficient CLIP model training
Training high-quality CLIP models typically requires enormous datasets, which
limits the development of domain-specific models -- especially in areas that
even the largest CLIP models do not cover well -- and drives up training costs.
This poses challenges for scientific research that needs fine-grained control
over the training procedure of CLIP models. In this work, we show that by
employing smart web search strategies enhanced with knowledge graphs, a robust
CLIP model can be trained from scratch with considerably less data.
Specifically, we demonstrate that an expert foundation model for living
organisms can be built using just 10M images. Moreover, we introduce EntityNet,
a dataset comprising 33M images paired with 46M text descriptions, which
enables the training of a generic CLIP model in significantly reduced time.
comment: Accepted for oral presentation at GCPR 2025 (German Conference on
Pattern Recognition). This is the version submitted to the conference, not
the official conference proceedings
♻ ☆ Experience-Guided Reflective Co-Evolution of Prompts and Heuristics for Automatic Algorithm Design
Combinatorial optimization problems are traditionally tackled with
handcrafted heuristic algorithms, which demand extensive domain expertise and
significant implementation effort. Recent progress has highlighted the
potential of automatic heuristics design powered by large language models
(LLMs), enabling the automatic generation and refinement of heuristics. These
approaches typically maintain a population of heuristics and employ LLMs as
mutation operators to evolve them across generations. While effective, such
methods often risk stagnating in local optima. To address this issue, we
propose the Experience-Guided Reflective Co-Evolution of Prompt and Heuristics
(EvoPH) for automatic algorithm design, a novel framework that integrates the
island migration model with the elites selection algorithm to simulate diverse
heuristics populations. In EvoPH, prompts are co-evolved with heuristic
algorithms, guided by performance feedback. We evaluate our framework on two
problems, i.e., Traveling Salesman Problem and Bin Packing Problem.
Experimental results demonstrate that EvoPH achieves the lowest relative error
against optimal solutions across both datasets, advancing the field of
automatic algorithm design with LLMs.
♻ ☆ SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL EMNLP 2025
Text-to-SQL aims to convert natural language questions into executable SQL
queries. While previous approaches, such as skeleton-masked selection, have
demonstrated strong performance by retrieving similar training examples to
guide large language models (LLMs), they struggle in real-world scenarios where
such examples are unavailable. To overcome this limitation, we propose
Self-Augmentation in-context learning with Fine-grained Example selection for
Text-to-SQL (SAFE-SQL), a novel framework that improves SQL generation by
generating and filtering self-augmented examples. SAFE-SQL first prompts an LLM
to generate multiple Text-to-SQL examples relevant to the test input. Then
SAFE-SQL filters these examples through three relevance assessments,
constructing high-quality in-context learning examples. Using self-generated
examples, SAFE-SQL surpasses the previous zero-shot, and few-shot Text-to-SQL
frameworks, achieving higher execution accuracy. Notably, our approach provides
additional performance gains in extra hard and unseen scenarios, where
conventional methods often fail.
comment: EMNLP 2025 (Main Conference)
♻ ☆ Speculating LLMs' Chinese Training Data Pollution from Their Tokens
Qingjie Zhang, Di Wang, Haoting Qian, Liu Yan, Tianwei Zhang, Ke Xu, Qi Li, Minlie Huang, Hewu Li, Han Qiu
Tokens are basic elements in the datasets for LLM training. It is well-known
that many tokens representing Chinese phrases in the vocabulary of GPT
(4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or
online gambling. Based on this observation, our goal is to locate Polluted
Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens'
existence and training data. (1) We give a formal definition and taxonomy of
PoC tokens based on the GPT's vocabulary. (2) We build a PoC token detector via
fine-tuning an LLM to label PoC tokens in vocabularies by considering each
token's both semantics and related contents from the search engines. (3) We
study the speculation on the training data pollution via PoC tokens'
appearances (token ID). Experiments on GPT and other 23 LLMs indicate that
tokens widely exist while GPT's vocabulary behaves the worst: more than 23%
long Chinese tokens (i.e., a token with more than two Chinese characters) are
either porn or online gambling. We validate the accuracy of our speculation
method on famous pre-training datasets like C4 and Pile. Then, considering
GPT-4o, we speculate that the ratio of "Yui Hatano" related webpages in
GPT-4o's training data is around 0.5%.
♻ ☆ Object Detection with Multimodal Large Vision-Language Models: An In-depth Review
The fusion of language and vision in large vision-language models (LVLMs) has
revolutionized deep learning-based object detection by enhancing adaptability,
contextual reasoning, and generalization beyond traditional architectures. This
in-depth review presents a structured exploration of the state-of-the-art in
LVLMs, systematically organized through a three-step research review process.
First, we discuss the functioning of vision language models (VLMs) for object
detection, describing how these models harness natural language processing
(NLP) and computer vision (CV) techniques to revolutionize object detection and
localization. We then explain the architectural innovations, training
paradigms, and output flexibility of recent LVLMs for object detection,
highlighting how they achieve advanced contextual understanding for object
detection. The review thoroughly examines the approaches used in integration of
visual and textual information, demonstrating the progress made in object
detection using VLMs that facilitate more sophisticated object detection and
localization strategies. This review presents comprehensive visualizations
demonstrating LVLMs' effectiveness in diverse scenarios including localization
and segmentation, and then compares their real-time performance, adaptability,
and complexity to traditional deep learning systems. Based on the review, its
is expected that LVLMs will soon meet or surpass the performance of
conventional methods in object detection. The review also identifies a few
major limitations of the current LVLM modes, proposes solutions to address
those challenges, and presents a clear roadmap for the future advancement in
this field. We conclude, based on this study, that the recent advancement in
LVLMs have made and will continue to make a transformative impact on object
detection and robotic applications in the future.
comment: First Peer Reviewed Review Paper for Object Detection with
Vision-Language Models (VLMs)
♻ ☆ Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training
Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu
Continual post-training (CPT) is a popular and effective technique for
adapting foundation models like multimodal large language models to specific
and ever-evolving downstream tasks. While existing research has primarily
concentrated on methods like data replay, model expansion, or parameter
regularization, the fundamental role of the learning paradigm within CPT
remains largely unexplored. This paper presents a comparative analysis of two
core post-training paradigms: supervised fine-tuning (SFT) and reinforcement
fine-tuning (RFT), investigating their respective impacts on knowledge
retention during CPT. Our experiments are conducted on a benchmark comprising
seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base
model for continual post-training. The investigation yields two significant
findings: (1) When continuously learning on downstream tasks, SFT leads to
catastrophic forgetting of previously learned tasks. In contrast, RFT
inherently preserves prior knowledge and achieve performance comparable to
multi-task training. (2) RFT successfully protects and even enhances the
model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro).
Conversely, SFT degrades general model capabilities severely. Further analysis
reveals that this stability is not primarily due to explicit mechanisms like KL
penalty or chain-of-thought reasoning. Instead, we identify an implicit
regularization mechanism inherent to RFT as a key contributing factor. Our
theoretical analysis suggests that RFT's gradient updates are naturally scaled
by the reward variance, acting as a data-dependent regularizer that inherently
protects previously acquired knowledge. Finally, we propose a rollout-based
instance filtering algorithm to enhance the stability and efficiency of RFT.
Our comprehensive study demonstrates the superiority of RFT as a robust
paradigm for continual post-training.
♻ ☆ GraphSearch: An Agentic Deep Searching Workflow for Graph Retrieval-Augmented Generation
Cehao Yang, Xiaojun Wu, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Yuanliang Sun, Jia Li, Hui Xiong, Jian Guo
Graph Retrieval-Augmented Generation (GraphRAG) enhances factual reasoning in
LLMs by structurally modeling knowledge through graph-based representations.
However, existing GraphRAG approaches face two core limitations: shallow
retrieval that fails to surface all critical evidence, and inefficient
utilization of pre-constructed structural graph data, which hinders effective
reasoning from complex queries. To address these challenges, we propose
\textsc{GraphSearch}, a novel agentic deep searching workflow with dual-channel
retrieval for GraphRAG. \textsc{GraphSearch} organizes the retrieval process
into a modular framework comprising six modules, enabling multi-turn
interactions and iterative reasoning. Furthermore, \textsc{GraphSearch} adopts
a dual-channel retrieval strategy that issues semantic queries over chunk-based
text data and relational queries over structural graph data, enabling
comprehensive utilization of both modalities and their complementary strengths.
Experimental results across six multi-hop RAG benchmarks demonstrate that
\textsc{GraphSearch} consistently improves answer accuracy and generation
quality over the traditional strategy, confirming \textsc{GraphSearch} as a
promising direction for advancing graph retrieval-augmented generation.
♻ ☆ FlowRL: Matching Reward Distributions for LLM Reasoning
Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin
We propose FlowRL: matching the full reward distribution via flow balancing
instead of maximizing rewards in large language model (LLM) reinforcement
learning (RL). Recent advanced reasoning models adopt reward-maximizing methods
(\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while
neglecting less frequent but valid reasoning paths, thus reducing diversity. In
contrast, we transform scalar rewards into a normalized target distribution
using a learnable partition function, and then minimize the reverse KL
divergence between the policy and the target distribution. We implement this
idea as a flow-balanced optimization method that promotes diverse exploration
and generalizable reasoning trajectories. We conduct experiments on math and
code reasoning tasks: FlowRL achieves a significant average improvement of
$10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs
consistently better on code reasoning tasks. These results highlight reward
distribution-matching as a key step toward efficient exploration and diverse
reasoning in LLM reinforcement learning.
♻ ☆ Dagger Behind Smile: Fool LLMs with a Happy Ending Story EMNLP 2025
The wide adoption of Large Language Models (LLMs) has attracted significant
attention from $\textit{jailbreak}$ attacks, where adversarial prompts crafted
through optimization or manual design exploit LLMs to generate malicious
contents. However, optimization-based attacks have limited efficiency and
transferability, while existing manual designs are either easily detectable or
demand intricate interactions with LLMs. In this paper, we first point out a
novel perspective for jailbreak attacks: LLMs are more responsive to
$\textit{positive}$ prompts. Based on this, we deploy Happy Ending Attack (HEA)
to wrap up a malicious request in a scenario template involving a positive
prompt formed mainly via a $\textit{happy ending}$, it thus fools LLMs into
jailbreaking either immediately or at a follow-up malicious request. This has
made HEA both efficient and effective, as it requires only up to two turns to
fully jailbreak LLMs. Extensive experiments show that our HEA can successfully
jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro,
and achieves 88.79% attack success rate on average. We also provide
quantitative explanations for the success of HEA.
comment: EMNLP 2025 Findings
♻ ☆ iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs
Vision-Language Models (VLMs) are known to struggle with spatial reasoning
and visual alignment. To help overcome these limitations, we introduce iVISPAR,
an interactive multimodal benchmark designed to evaluate the spatial reasoning
capabilities of VLMs acting as agents. \mbox{iVISPAR} is based on a variant of
the sliding tile puzzle, a classic problem that demands logical planning,
spatial awareness, and multi-step reasoning. The benchmark supports visual 3D,
2D, and text-based input modalities, enabling comprehensive assessments of
VLMs' planning and reasoning skills. We evaluate a broad suite of
state-of-the-art open-source and closed-source VLMs, comparing their
performance while also providing optimal path solutions and a human baseline to
assess the task's complexity and feasibility for humans. Results indicate that
while VLMs perform better on 2D tasks compared to 3D or text-based settings,
they struggle with complex spatial configurations and consistently fall short
of human performance, illustrating the persistent challenge of visual
alignment. This underscores critical gaps in current VLM capabilities,
highlighting their limitations in achieving human-level cognition. Project
website: https://microcosm.ai/ivispar
♻ ☆ ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang
Vision-language models (VLMs) have demonstrated remarkable capabilities in
understanding and reasoning about visual content, but significant challenges
persist in tasks requiring cross-viewpoint understanding and spatial reasoning.
We identify a critical limitation: current VLMs excel primarily at egocentric
spatial reasoning (from the camera's perspective) but fail to generalize to
allocentric viewpoints when required to adopt another entity's spatial frame of
reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark
designed specifically for multi-viewpoint spatial localization recognition
evaluation across five distinct task types, supported by an automated 3D
annotation pipeline that generates precise directional labels. Comprehensive
evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant
performance disparity: models demonstrate reasonable performance on
camera-perspective tasks but exhibit reduced accuracy when reasoning from a
human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset,
we achieve an overall performance improvement of 46.24% across tasks,
highlighting the efficacy of our approach. Our work establishes a crucial
benchmark for spatial intelligence in embodied AI systems and provides
empirical evidence that modeling 3D spatial relationships enhances VLMs'
corresponding spatial comprehension capabilities.
comment: Project: https://zju-real.github.io/ViewSpatial-Page/
♻ ☆ Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity
Zhen Bi, Zhenlin Hu, Jinnan Yang, Mingyang Chen, Cheng Deng, Yida Xue, Zeyu Yang, Qing Shen, Zhenfang Liu, Kang Zhao, Ningyu Zhang, Jungang Lou
Recent advances in large language models (LLMs) highlight the importance of
training data structure and quality in shaping reasoning behavior. However,
most existing approaches focus on transforming data formats while neglecting
the internal reasoning complexity of training samples, leaving the reasoning
potential of data under-explored and underutilized. In this work, we posit that
LLM logical reasoning performance is jointly constrained by the potential of
the training data and the cognitive capacity of the model. To make this
relationship measurable, we introduce Data Reasoning Intensity (DRI), a novel
metric that quantifies the latent logical reasoning complexity of samples by
decomposing and aggregating their logical structures. This allows us to analyze
how well current LLMs utilize logical reasoning signals and identify
performance gaps relative to data potential. Based on this insight, we
introduce a re-cognizing optimization strategy that systematically enhances the
logical reasoning intensity of training data. Rather than increasing data
volume, our method re-optimizes existing samples to better align with the LLM's
logical reasoning boundary. Extensive experiments show that our approach
significantly improves performance and generalization over data-centric
strategies. We further validate our method under a reinforcement learning
framework. Our results indicate that prioritizing reasoning complexity in data
rather than sheer scale or superficial form is essential to realizing LLMs'
full cognitive potential.
♻ ☆ From Source to Target: Leveraging Transfer Learning for Predictive Process Monitoring in Organizations
Event logs reflect the behavior of business processes that are mapped in
organizational information systems. Predictive process monitoring (PPM)
transforms these data into value by creating process-related predictions that
provide the insights required for proactive interventions at process runtime.
Existing PPM techniques require sufficient amounts of event data or other
relevant resources that might not be readily available, which prevents some
organizations from utilizing PPM. The transfer learning-based PPM technique
presented in this paper allows organizations without suitable event data or
other relevant resources to implement PPM for effective decision support. This
technique is instantiated in both a real-life intra- and an
inter-organizational use case, based on which numerical experiments are
performed using event logs for IT service management processes. The results of
the experiments suggest that knowledge of one business process can be
transferred to a similar business process in the same or a different
organization to enable effective PPM in the target context. The proposed
technique allows organizations to benefit from transfer learning in intra- and
inter-organizational settings by transferring resources such as pre-trained
models within and across organizational boundaries.
♻ ☆ PAME-AI: Patient Messaging Creation and Optimization using Agentic AI
Messaging patients is a critical part of healthcare communication, helping to
improve things like medication adherence and healthy behaviors. However,
traditional mobile message design has significant limitations due to its
inability to explore the high-dimensional design space. We develop PAME-AI, a
novel approach for Patient Messaging Creation and Optimization using Agentic
AI. Built on the Data-Information-Knowledge-Wisdom (DIKW) hierarchy, PAME-AI
offers a structured framework to move from raw data to actionable insights for
high-performance messaging design. PAME-AI is composed of a system of
specialized computational agents that progressively transform raw experimental
data into actionable message design strategies. We demonstrate our approach's
effectiveness through a two-stage experiment, comprising of 444,691 patient
encounters in Stage 1 and 74,908 in Stage 2. The best-performing generated
message achieved 68.76% engagement compared to the 61.27% baseline,
representing a 12.2% relative improvement in click-through rates. This agentic
architecture enables parallel processing, hypothesis validation, and continuous
learning, making it particularly suitable for large-scale healthcare
communication optimization.
♻ ☆ Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models
Language models (LMs) are increasingly used to build agents that can act
autonomously to achieve goals. During this automatic process, agents need to
take a series of actions, some of which might lead to severe consequences if
incorrect actions are taken. Therefore, such agents must sometimes
defer-refusing to act when their confidence is insufficient-to avoid the
potential cost of incorrect actions. Because the severity of consequences
varies across applications, the tendency to defer should also vary: in low-risk
settings agents should answer more freely, while in high-risk settings their
decisions should be more conservative. We study this "answer-or-defer" problem
with an evaluation framework that systematically varies human-specified risk
structures-rewards and penalties for correct answers, incorrect answers, and
refusals $(r_{\mathrm{cor}},r_{\mathrm{inc}}, r_{\mathrm{ref}})$-while keeping
tasks fixed. This design evaluates LMs' risk-aware decision policies by
measuring their ability to maximize expected reward. Across multiple datasets
and models, we identify flaws in their decision policies: LMs tend to
over-answer in high-risk settings and over-defer in low-risk settings. After
analyzing the potential cause of such flaws, we find that a simple
skill-decomposition method, which isolates the independent skills required for
answer-or-defer decision making, can consistently improve LMs' decision
policies. Our results highlight the current limitations of LMs in
risk-conditioned decision making and provide practical guidance for deploying
more reliable LM-based agents across applications of varying risk levels.
comment: preprint
♻ ☆ VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song
Video-conditioned sound and speech generation, encompassing video-to-sound
(V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed
as separate tasks, with limited exploration to unify them within a signle
framework. Recent attempts to unify V2S and VisualTTS face challenges in
handling distinct condition types (e.g., heterogeneous video and transcript
conditions) and require complex training stages. Unifying these two tasks
remains an open problem. To bridge this gap, we present VSSFlow, which
seamlessly integrates both V2S and VisualTTS tasks into a unified flow-matching
framework. VSSFlow uses a novel condition aggregation mechanism to handle
distinct input signals. We find that cross-attention and self-attention layer
exhibit different inductive biases in the process of introducing condition.
Therefore, VSSFlow leverages these inductive biases to effectively handle
different representations: cross-attention for ambiguous video conditions and
self-attention for more deterministic speech transcripts. Furthermore, contrary
to the prevailing belief that joint training on the two tasks requires complex
training strategies and may degrade performance, we find that VSSFlow benefits
from the end-to-end joint learning process for sound and speech generation
without extra designs on training stages. Detailed analysis attributes it to
the learned general audio prior shared between tasks, which accelerates
convergence, enhances conditional generation, and stabilizes the
classifier-free guidance process. Extensive experiments demonstrate that
VSSFlow surpasses the state-of-the-art domain-specific baselines on both V2S
and VisualTTS benchmarks, underscoring the critical potential of unified
generative models.
comment: Paper Under Review
♻ ☆ The Ever-Evolving Science Exam
Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai
As foundation models grow rapidly in capability and deployment, evaluating
their scientific understanding becomes increasingly critical. Existing science
benchmarks have made progress towards broad Range, wide Reach, and high Rigor,
yet they often face two major challenges: data leakage risks that compromise
benchmarking validity, and evaluation inefficiency due to large-scale testing.
To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a
dynamic benchmark designed to reliably assess scientific capabilities in
foundation models. Our approach consists of two components: 1) a non-public
EESE-Pool with over 100K expertly constructed science instances
(question-answer pairs) across 5 disciplines and 500+ subfields, built through
a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically
updated 500-instance subset EESE, sampled and validated to enable
leakage-resilient, low-overhead evaluations. Experiments on 32 open- and
closed-source models demonstrate that EESE effectively differentiates the
strengths and weaknesses of models in scientific fields and cognitive
dimensions. Overall, EESE provides a robust, scalable, and forward-compatible
solution for science benchmark design, offering a realistic measure of how well
foundation models handle science questions. The project page is at:
https://github.com/aiben-ch/EESE.
comment: 33 pages
♻ ☆ Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs
Junying Wang, Zicheng Zhang, Ye Shen, Yalun Wu, Yingji Liang, Yijin Guo, Farong Wen, Wenzhe Li, Xuezhi Zhao, Qi Jia, Guangtao Zhai
High-quality, multi-modal benchmarks are crucial for advancing scientific
reasoning in large models yet their manual creation is costly and unscalable.
To address this bottleneck, we explore the potential for transforming Text-Only
QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include
three parts: 1) Task Definition \& Evaluation Rubric: We develop a TQA-to-MMQA
framework and establish a comprehensive, multi-dimensional MMQA quality rubric
that provides principles for the transformation. 2) Benchmark Construction:
Then we construct two extensive benchmarks to rigorously evaluate
state-of-the-art generation \& understanding models on the distinct tasks of
MMQA generation \& MMQA quality evaluation. 3) Preliminary Solution: We develop
an agentic system (Q-Mirror), which operationalizes our framework by
integrating MMQA generation and evaluation into a closed loop for iterative
refinement. Our experiments show that while state-of-the-art models can
generate MMQAs, their outputs still leave substantial gaps, underscoring the
need for reliable evaluation. We further demonstrate that top-tier
understanding models align closely with human judgment in MMQA quality
assessment. Leveraging both insights, the Q-Mirror agent raises average scores
from 78.90 to 85.22 and pass rates from 72\% to 95\%, offering a practical path
to large-scale scientific benchmarks.
comment: 25 pages
♻ ☆ QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
Reinforcement learning (RL) has emerged as a central paradigm for training
large language models (LLMs) in reasoning tasks. Yet recent studies question
RL's ability to incentivize reasoning capacity beyond the base model. This
raises a key challenge: how can RL be adapted to solve harder reasoning
problems more effectively? To address this challenge, we propose a simple yet
effective strategy via Question Augmentation: introduce partial solutions
during training to reduce problem difficulty and provide more informative
learning signals. Our method, QuestA, when applied during RL training on math
reasoning tasks, not only improves pass@1 but also pass@k-particularly on
problems where standard RL struggles to make progress. This enables continual
improvement over strong open-source models such as DeepScaleR and OpenMath
Nemotron, further enhancing their reasoning capabilities. We achieve new
state-of-the-art results on math benchmarks using 1.5B-parameter models: 72.50%
(+10.73%) on AIME24, 62.29% (+12.79%) on AIME25, and 41.67% (+10.11%) on
HMMT25. Code, data and model are available at
https://github.com/foreverlasting1202/QuestA.
comment: 19 pages, 8 figures
♻ ☆ ConfRAG: Confidence-Guided Retrieval-Augmenting Generation
Yin Huang, Yifan Ethan Xu, Kai Sun, Vera Yan, Alicia Sun, Haidar Khan, Jimmy Nguyen, Jingxiang Chen, Mohammad Kachuee, Zhaojiang Lin, Yue Liu, Aaron Colak, Anuj Kumar, Wen-tau Yih, Xin Luna Dong
Can Large Language Models (LLMs) be trained to avoid hallucinating factual
statements, and can Retrieval-Augmented Generation (RAG) be triggered only when
necessary to reduce retrieval and computation costs? In this work, we address
both challenges simultaneously. We introduce ConfQA, a fine-tuning strategy
that reduces hallucination rates from 20-40% to below 5% across multiple
factuality benchmarks. The approach is simple: when the model answers
correctly, it is trained to output the answer; otherwise, it is trained to
respond with "I am unsure". Two design choices make this training effective:
(1) a dampening prompt ("answer only if you are confident") that explicitly
discourages overconfident hallucinations, and (2) training data drawn from
atomic factual statements (e.g., knowledge graph attribute values), which
calibrates model confidence and yields robust generalization across domains and
question types. Building on ConfQA, we propose ConfRAG, a triggering strategy
that invokes RAG only when the model responses with unsure. This framework
achieves accuracy above 95% in ideal case while reducing unnecessary external
retrievals by over 30%.
comment: 10 pages main content, 7 pages appendix, 6 figures, 10 tables