Computation and Language 39
♻ ☆ Capturing Visualization Design Rationale IEEE VIS 2025
Prior natural language datasets for data visualization have focused on tasks
such as visualization literacy assessment, insight generation, and
visualization generation from natural language instructions. These studies
often rely on controlled setups with purpose-built visualizations and
artificially constructed questions. As a result, they tend to prioritize the
interpretation of visualizations, focusing on decoding visualizations rather
than understanding their encoding. In this paper, we present a new dataset and
methodology for probing visualization design rationale through natural
language. We leverage a unique source of real-world visualizations and natural
language narratives: literate visualization notebooks created by students as
part of a data visualization course. These notebooks combine visual artifacts
with design exposition, in which students make explicit the rationale behind
their design decisions. We also use large language models (LLMs) to generate
and categorize question-answer-rationale triples from the narratives and
articulations in the notebooks. We then carefully validate the triples and
curate a dataset that captures and distills the visualization design choices
and corresponding rationales of the students.
comment: To be presented at IEEE VIS 2025
♻ ☆ Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion
Effective modeling of multifaceted relations is pivotal for Knowledge Graph
Completion (KGC). However, a majority of existing approaches are predicated on
static, embedding-based scoring, exhibiting inherent limitations in capturing
contextual dependencies and relational dynamics. Addressing this gap, we
propose the Flow-Modulated Scoring (FMS) framework. FMS comprises two principal
components: (1) a semantic context learning module that encodes
context-sensitive entity representations, and (2) a conditional flow-matching
module designed to learn the dynamic transformation from a head to a tail
embedding, governed by the aforementioned context. The resultant predictive
vector field, representing the context-informed relational path, serves to
dynamically refine the initial static score of an entity pair. Through this
synergy of context-aware static representations and conditioned dynamic
information, FMS facilitates a more profound modeling of relational semantics.
Comprehensive evaluations on several standard benchmarks demonstrate that our
proposed method surpasses prior state-of-the-art results.
comment: 10 pages
♻ ☆ Large Language Model Confidence Estimation via Black-Box Access
Estimating uncertainty or confidence in the responses of a model can be
significant in evaluating trust not only in the responses, but also in the
model as a whole. In this paper, we explore the problem of estimating
confidence for responses of large language models (LLMs) with simply black-box
or query access to them. We propose a simple and extensible framework where, we
engineer novel features and train a (interpretable) model (viz. logistic
regression) on these features to estimate the confidence. We empirically
demonstrate that our simple framework is effective in estimating confidence of
Flan-ul2, Llama-13b, Mistral-7b and GPT-4 on four benchmark Q\&A tasks as well
as of Pegasus-large and BART-large on two benchmark summarization tasks with it
surpassing baselines by even over $10\%$ (on AUROC) in some cases.
Additionally, our interpretable approach provides insight into features that
are predictive of confidence, leading to the interesting and useful discovery
that our confidence models built for one LLM generalize zero-shot across others
on a given dataset.
comment: Accepted to TMLR 2025
♻ ☆ MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi
Recent advancements in AI agents have demonstrated their growing potential to
drive and support scientific discovery. In this work, we introduce MLR-Bench, a
comprehensive benchmark for evaluating AI agents on open-ended machine learning
research. MLR-Bench includes three key components: (1) 201 research tasks
sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2)
MLR-Judge, an automated evaluation framework combining LLM-based reviewers with
carefully designed review rubrics to assess research quality; and (3)
MLR-Agent, a modular agent scaffold capable of completing research tasks
through four stages: idea generation, proposal formulation, experimentation,
and paper writing. Our framework supports both stepwise assessment across these
distinct research stages, and end-to-end evaluation of the final research
paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced
coding agent, finding that while LLMs are effective at generating coherent
ideas and well-structured papers, current coding agents frequently (e.g., in
80% of the cases) produce fabricated or invalidated experimental
results--posing a major barrier to scientific reliability. We validate
MLR-Judge through human evaluation, showing high agreement with expert
reviewers, supporting its potential as a scalable tool for research evaluation.
We open-source MLR-Bench to help the community benchmark, diagnose, and improve
AI research agents toward trustworthy and transparent scientific discovery.
comment: 42 pages, 9 figures
♻ ☆ Intertextual Parallel Detection in Biblical Hebrew: A Transformer-Based Benchmark
Identifying parallel passages in biblical Hebrew (BH) is central to biblical
scholarship for understanding intertextual relationships. Traditional methods
rely on manual comparison, a labor-intensive process prone to human error. This
study evaluates the potential of pre-trained transformer-based language models,
including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in
the Hebrew Bible. Focusing on known parallels between Samuel/Kings and
Chronicles, I assessed each model's capability to generate word embeddings
distinguishing parallel from non-parallel passages. Using cosine similarity and
Wasserstein Distance measures, I found that E5 and AlephBERT show promise; E5
excels in parallel detection, while AlephBERT demonstrates stronger
non-parallel differentiation. These findings indicate that pre-trained models
can enhance the efficiency and accuracy of detecting intertextual parallels in
ancient texts, suggesting broader applications for ancient language studies.
♻ ☆ Benchmarking the Pedagogical Knowledge of Large Language Models
Maxime Lelièvre, Amy Waldock, Meng Liu, Natalia Valdés Aspillaga, Alasdair Mackintosh, María José Ogando Portela, Jared Lee, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod
Benchmarks like Massive Multitask Language Understanding (MMLU) have played a
pivotal role in evaluating AI's knowledge and abilities across diverse domains.
However, existing benchmarks predominantly focus on content knowledge, leaving
a critical gap in assessing models' understanding of pedagogy - the method and
practice of teaching. This paper introduces The Pedagogy Benchmark, a novel
dataset designed to evaluate large language models on their Cross-Domain
Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND)
pedagogical knowledge. These benchmarks are built on a carefully curated set of
questions sourced from professional development exams for teachers, which cover
a range of pedagogical subdomains such as teaching strategies and assessment
methods. Here we outline the methodology and development of these benchmarks.
We report results for 97 models, with accuracies spanning a range from 28% to
89% on the pedagogical knowledge questions. We consider the relationship
between cost and accuracy and chart the progression of the Pareto value
frontier over time. We provide online leaderboards at
https://rebrand.ly/pedagogy which are updated with new models and allow
interactive exploration and filtering based on various model properties, such
as cost per token and open-vs-closed weights, as well as looking at performance
in different subjects. LLMs and generative AI have tremendous potential to
influence education and help to address the global learning crisis.
Education-focused benchmarks are crucial to measure models' capacities to
understand pedagogical concepts, respond appropriately to learners' needs, and
support effective teaching practices across diverse contexts. They are needed
for informing the responsible and evidence-based deployment of LLMs and
LLM-based tools in educational settings, and for guiding both development and
policy decisions.
♻ ☆ Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report
This report synthesizes the outcomes of a recent interdisciplinary workshop
that brought together leading experts in cognitive psychology, language
learning, and artificial intelligence (AI)-based natural language processing
(NLP). The workshop, funded by the National Science Foundation, aimed to
address a critical knowledge gap in our understanding of the relationship
between AI language models and human cognitive processes in text comprehension
and composition. Through collaborative dialogue across cognitive, linguistic,
and technological perspectives, workshop participants examined the underlying
processes involved when humans produce and comprehend text, and how AI can both
inform our understanding of these processes and augment human capabilities. The
workshop revealed emerging patterns in the relationship between large language
models (LLMs) and human cognition, with highlights on both the capabilities of
LLMs and their limitations in fully replicating human-like language
understanding and generation. Key findings include the potential of LLMs to
offer insights into human language processing, the increasing alignment between
LLM behavior and human language processing when models are fine-tuned with
human feedback, and the opportunities and challenges presented by human-AI
collaboration in language tasks. By synthesizing these findings, this report
aims to guide future research, development, and implementation of LLMs in
cognitive psychology, linguistics, and education. It emphasizes the importance
of ethical considerations and responsible use of AI technologies while striving
to enhance human capabilities in text comprehension and production through
effective human-AI collaboration.
♻ ☆ A Study of In-Context-Learning-Based Text-to-SQL Errors
Jiawei Shen, Chengcheng Wan, Ruoyi Qiao, Jiazhen Zou, Hang Xu, Yuchen Shao, Yueling Zhang, Weikai Miao, Geguang Pu
Large language models (LLMs) have been adopted to perform text-to-SQL tasks,
utilizing their in-context learning (ICL) capability to translate natural
language questions into structured query language (SQL). However, such a
technique faces correctness problems and requires efficient repairing
solutions. In this paper, we conduct the first comprehensive study of
text-to-SQL errors. Our study covers four representative ICL-based techniques,
five basic repairing methods, two benchmarks, and two LLM settings. We find
that text-to-SQL errors are widespread and summarize 29 error types of 7
categories. We also find that existing repairing attempts have limited
correctness improvement at the cost of high computational overhead with many
mis-repairs. Based on the findings, we propose MapleRepair, a novel text-to-SQL
error detection and repairing framework. The evaluation demonstrates that
MapleRepair outperforms existing solutions by repairing 13.8% more queries with
neglectable mis-repairs and 67.4% less overhead.
♻ ☆ OM4OV: Leveraging Ontology Matching for Ontology Versioning
Due to the dynamic nature of the Semantic Web, version control is necessary
to capture time-varying information, particularly for widely used ontologies.
Despite the long-standing recognition of ontology versioning (OV) as a crucial
component for efficient ontology management, the growing size of ontologies and
accumulating errors caused by manual labour overwhelm current OV approaches. In
this paper, we propose a fresh approach to performing OV using existing
ontology matching (OM) techniques and systems. We introduce a unified OM4OV
pipeline. From an OM perspective, we reconstruct a new task formulation and
measurements for OV tasks. Building upon the prior alignment(s) from OM, we
propose a pipeline optimisation method called the cross-reference (CR)
mechanism to enhance overall OV performance. We experimentally validate the
OM4OV pipeline and the cross-reference mechanism in an OV testbed originating
from the Ontology Alignment Evaluation Initiative (OAEI) datasets. We also
discuss insights into OM used for OV tasks, where some apparent false mappings
detected by OV systems are not actually untrue.
comment: 15 pages, 8 figures, 1 table
♻ ☆ HyperCLOVA X THINK Technical Report
We introduce HyperCLOVA X THINK, the first reasoning-focused large language
model in the HyperCLOVA X family, pre-trained on roughly $6$ trillion
high-quality Korean, and English tokens, augmented with targeted synthetic
Korean data. It was implemented as a compute-memory-balanced Peri-LN
Transformer scaled with $\mu$P, pre-trained through a three-stage curriculum
that expands the context window to $128$K tokens, and post-trained via
supervised fine-tuning with Reinforcement Learning from Verifiable Rewards
supports both detailed rationale and concise-answer modes. It delivers
competitive performance against similarly sized models on Korea-focused
benchmarks such as KMMLU, CSAT, KoBALT-700, HAERAE-1.0, and KoBigBench, while
preserving robust bilingual consistency and translation quality. In addition, a
vision-augmented variant matches or exceeds GPT-4.1 on the KCSAT STEM
benchmark, all of which are achieved with substantially lower training compute
than existing models of similar sizes. We also present a pruning and
distillation technique that will soon be applied to HyperCLOVA X THINK for an
open-source and business-friendly foundation model. Altogether, these
capabilities position HyperCLOVA X THINK as a robust foundation for Korean AI
innovation and a valuable resource for the global research community.
comment: 50 pages, 13 figures; fixed figures in the appendix
♻ ☆ AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, XiaoFeng Wang, Wenyuan Xu, Wei Dong, Xinfeng Li
The rapid advancement and expanding applications of Audio Large Language
Models (ALLMs) demand a rigorous understanding of their trustworthiness.
However, systematic research on evaluating these models, particularly
concerning risks unique to the audio modality, remains largely unexplored.
Existing evaluation frameworks primarily focus on the text modality or address
only a restricted set of safety dimensions, failing to adequately account for
the unique characteristics and application scenarios inherent to the audio
modality. We introduce AudioTrust-the first multifaceted trustworthiness
evaluation framework and benchmark specifically designed for ALLMs. AudioTrust
facilitates assessments across six key dimensions: fairness, hallucination,
safety, privacy, robustness, and authentication. To comprehensively evaluate
these dimensions, AudioTrust is structured around 18 distinct experimental
setups. Its core is a meticulously constructed dataset of over 4,420 audio/text
samples, drawn from real-world scenarios (e.g., daily conversations, emergency
calls, voice assistant interactions), specifically designed to probe the
multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully
designs 9 audio-specific evaluation metrics, and we employ a large-scale
automated pipeline for objective and scalable scoring of model outputs.
Experimental results reveal the trustworthiness boundaries and limitations of
current state-of-the-art open-source and closed-source ALLMs when confronted
with various high-risk audio scenarios, offering valuable insights for the
secure and trustworthy deployment of future audio models. Our platform and
benchmark are available at https://github.com/JusperLee/AudioTrust.
comment: Technical Report
♻ ☆ Quasi-symbolic Semantic Geometry over Transformer-based Variational AutoEncoder CoNLL2025
Formal/symbolic semantics can provide canonical, rigid controllability and
interpretability to sentence representations due to their \textit{localisation}
or \textit{composition} property. How can we deliver such property to the
current distributional sentence representations to control and interpret the
generation of language models (LMs)? In this work, we theoretically frame the
sentence semantics as the composition of \textit{semantic role - word content}
features and propose the formal semantic geometry. To inject such geometry into
Transformer-based LMs (i.e. GPT2), we deploy Transformer-based Variational
AutoEncoder with a supervision approach, where the sentence generation can be
manipulated and explained over low-dimensional latent Gaussian space. In
addition, we propose a new probing algorithm to guide the movement of sentence
vectors over such geometry. Experimental results reveal that the formal
semantic geometry can potentially deliver better control and interpretation to
sentence generation.
comment: CoNLL2025 (Best Paper nomination)
♻ ☆ T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li
Recent advancements in large language models have demonstrated how
chain-of-thought (CoT) and reinforcement learning (RL) can improve performance.
However, applying such reasoning strategies to the visual generation domain
remains largely unexplored. In this paper, we present T2I-R1, a novel
reasoning-enhanced text-to-image generation model, powered by RL with a
bi-level CoT reasoning process. Specifically, we identify two levels of CoT
that can be utilized to enhance different stages of generation: (1) the
semantic-level CoT for high-level planning of the prompt and (2) the
token-level CoT for low-level pixel processing during patch-by-patch
generation. To better coordinate these two levels of CoT, we introduce
BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes
both generation CoTs within the same training step. By applying our reasoning
strategies to the baseline model, Janus-Pro, we achieve superior performance
with 13% improvement on T2I-CompBench and 19% improvement on the WISE
benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available
at: https://github.com/CaraJ7/T2I-R1
comment: Project Page: https://github.com/CaraJ7/T2I-R1
♻ ☆ Iterative Resolution of Prompt Ambiguities Using a Progressive Cutting-Search Approach
Generative AI systems have revolutionized human interaction by enabling
natural language-based coding and problem solving. However, the inherent
ambiguity of natural language often leads to imprecise instructions, forcing
users to iteratively test, correct, and resubmit their prompts. We propose an
iterative approach that systematically narrows down these ambiguities through a
structured series of clarification questions and alternative solution
proposals, illustrated with input/output examples as well. Once every
uncertainty is resolved, a final, precise solution is generated. Evaluated on a
diverse dataset spanning coding, data analysis, and creative writing, our
method demonstrates superior accuracy, competitive resolution times, and higher
user satisfaction compared to conventional one-shot solutions, which typically
require multiple manual iterations to achieve a correct output.
♻ ☆ Not Minds, but Signs: Reframing LLMs through Semiotics
This paper challenges the prevailing tendency to frame Large Language Models
(LLMs) as cognitive systems, arguing instead for a semiotic perspective that
situates these models within the broader dynamics of sign manipulation and
meaning-making. Rather than assuming that LLMs understand language or simulate
human thought, we propose that their primary function is to recombine,
recontextualize, and circulate linguistic forms based on probabilistic
associations. By shifting from a cognitivist to a semiotic framework, we avoid
anthropomorphism and gain a more precise understanding of how LLMs participate
in cultural processes, not by thinking, but by generating texts that invite
interpretation. Through theoretical analysis and practical examples, the paper
demonstrates how LLMs function as semiotic agents whose outputs can be treated
as interpretive acts, open to contextual negotiation and critical reflection.
We explore applications in literature, philosophy, education, and cultural
production, emphasizing how LLMs can serve as tools for creativity, dialogue,
and critical inquiry. The semiotic paradigm foregrounds the situated,
contingent, and socially embedded nature of meaning, offering a more rigorous
and ethically aware framework for studying and using LLMs. Ultimately, this
approach reframes LLMs as technological participants in an ongoing ecology of
signs. They do not possess minds, but they alter how we read, write, and make
meaning, compelling us to reconsider the foundations of language,
interpretation, and the role of artificial systems in the production of
knowledge.
♻ ☆ Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences
Positional bias in binary question answering occurs when a model
systematically favors one choice over another based solely on the ordering of
presented options. In this study, we quantify and analyze positional bias
across five large language models under varying degrees of answer uncertainty.
We re-adapted the SQuAD-it dataset by adding an extra incorrect answer option
and then created multiple versions with progressively less context and more
out-of-context answers, yielding datasets that range from low to high
uncertainty. Additionally, we evaluate two naturally higher-uncertainty
benchmarks: (1) WebGPT - question pairs with unequal human-assigned quality
scores, and (2) Winning Arguments - where models predict the more persuasive
argument in Reddit's r/ChangeMyView exchanges. Across each dataset, the order
of the "correct" (or higher-quality/persuasive) option is systematically
flipped (first placed in position 1, then in position 2) to compute both
Preference Fairness and Position Consistency. We observe that positional bias
is nearly absent under low-uncertainty conditions, but grows exponentially when
it becomes doubtful to decide which option is correct.
♻ ☆ Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion ACL
Language models (LMs) can make a correct prediction based on many possible
signals in a prompt, not all corresponding to recall of factual associations.
However, current interpretations of LMs fail to take this into account. For
example, given the query "Astrid Lindgren was born in" with the corresponding
completion "Sweden", no difference is made between whether the prediction was
based on knowing where the author was born or assuming that a person with a
Swedish-sounding name was born in Sweden. In this paper, we present a
model-specific recipe - PrISM - for constructing datasets with examples of four
different prediction scenarios: generic language modeling, guesswork,
heuristics recall and exact fact recall. We apply two popular interpretability
methods to the scenarios: causal tracing (CT) and information flow analysis. We
find that both yield distinct results for each scenario. Results for exact fact
recall and generic language modeling scenarios confirm previous conclusions
about the importance of mid-range MLP sublayers for fact recall, while results
for guesswork and heuristics indicate a critical role of late last token
position MLP sublayers. In summary, we contribute resources for a more
extensive and granular study of fact completion in LMs, together with analyses
that provide a more nuanced understanding of how LMs process fact-related
queries.
comment: accepted to ACL Findings 2025
♻ ☆ Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language
Domain-adaptive continual pretraining (DAPT) is a state-of-the-art technique
that further trains a language model (LM) on its pretraining task, e.g., masked
language modeling (MLM), when common domain adaptation via LM fine-tuning is
not possible due to a lack of labeled task data. Although popular, MLM requires
a significant corpus of domain-related data, which is difficult to obtain for
specific domains in languages other than English, such as the process industry
in the German language. This paper introduces an efficient approach called
ICL-augmented pretraining or ICL-APT that leverages in-context learning (ICL)
and k-nearest neighbors (kNN) to augment target data with domain-related and
in-domain texts, significantly reducing GPU time while maintaining strong model
performance. Our results show that the best configuration of ICL-APT performed
better than the state-of-the-art DAPT by 28.7% (7.87 points) and requires
almost 4 times less GPU-computing time, providing a cost-effective solution for
industries with limited computational capacity. The findings highlight the
broader applicability of this framework to other low-resource industries,
making NLP-based solutions more accessible and feasible in production
environments.
comment: accepted to TSD 2025
♻ ☆ Integrating Expert Labels into LLM-based Emission Goal Detection: Example Selection vs Automatic Prompt Design
We address the detection of emission reduction goals in corporate reports, an
important task for monitoring companies' progress in addressing climate change.
Specifically, we focus on the issue of integrating expert feedback in the form
of labeled example passages into LLM-based pipelines, and compare the two
strategies of (1) a dynamic selection of few-shot examples and (2) the
automatic optimization of the prompt by the LLM itself. Our findings on a
public dataset of 769 climate-related passages from real-world business reports
indicate that automatic prompt optimization is the superior approach, while
combining both methods provides only limited benefit. Qualitative results
indicate that optimized prompts do indeed capture many intricacies of the
targeted emission goal extraction task.
♻ ☆ DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models NeurIPS 2024
Bowen Wang, Jiuyang Chang, Yiming Qian, Guoxin Chen, Junhao Chen, Zhouqiang Jiang, Jiahao Zhang, Yuta Nakashima, Hajime Nagahara
Large language models (LLMs) have recently showcased remarkable capabilities,
spanning a wide range of tasks and applications, including those in the medical
domain. Models like GPT-4 excel in medical question answering but may face
challenges in the lack of interpretability when handling complex tasks in real
clinical settings. We thus introduce the diagnostic reasoning dataset for
clinical notes (DiReCT), aiming at evaluating the reasoning ability and
interpretability of LLMs compared to human doctors. It contains 511 clinical
notes, each meticulously annotated by physicians, detailing the diagnostic
reasoning process from observations in a clinical note to the final diagnosis.
Additionally, a diagnostic knowledge graph is provided to offer essential
knowledge for reasoning, which may not be covered in the training data of
existing LLMs. Evaluations of leading LLMs on DiReCT bring out a significant
gap between their reasoning ability and that of human doctors, highlighting the
critical need for models that can reason effectively in real-world clinical
scenarios.
comment: Accepted by NeurIPS 2024 D&B Track
♻ ☆ An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analyses
Large Language models (LLMs) have been prominent for language translation,
including low-resource languages. There has been limited study on the
assessment of the quality of translations generated by LLMs, including Gemini,
GPT, and Google Translate. This study addresses this limitation by using
semantic and sentiment analysis of selected LLMs for Indian languages,
including Sanskrit, Telugu and Hindi. We select prominent texts (Bhagavad Gita,
Tamas and Maha Prasthanam ) that have been well translated by experts and use
LLMs to generate their translations into English, and provide a comparison with
selected expert (human) translations. Our investigation revealed that while
LLMs have made significant progress in translation accuracy, challenges remain
in preserving sentiment and semantic integrity, especially in metaphorical and
philosophical contexts for texts such as the Bhagavad Gita. The sentiment
analysis revealed that GPT models are better at preserving the sentiment
polarity for the given texts when compared to human (expert) translation. The
results revealed that GPT models are generally better at maintaining the
sentiment and semantics when compared to Google Translate. This study could
help in the development of accurate and culturally sensitive translation
systems for large language models.
♻ ☆ SAGE: Steering Dialog Generation with Future-Aware State-Action Augmentation
Recent advances in large language models have demonstrated impressive
capabilities in task-oriented applications, yet building emotionally
intelligent chatbots that can engage in natural, strategic conversations
remains a challenge. We present a novel approach called SAGE that uses latent
variables to control long-horizon behavior in dialogue generation. At the core
of our method is the State-Action Chain (SAC), which augments standard language
model fine-tuning by introducing latent variables that encapsulate emotional
states and conversational strategies between dialogue turns. During inference,
these variables are generated before each response, enabling coarse-grained
control over dialogue progression while maintaining natural interaction
patterns. We also introduce a self-improvement pipeline that leverages dialogue
tree search, LLM-based reward modeling, and targeted fine-tuning to optimize
conversational trajectories. Our experimental results show that models trained
with this approach demonstrate improved performance in emotional intelligence
metrics while maintaining strong capabilities on LLM benchmarks. The discrete
nature of our latent variables facilitates search-based strategies and provides
a foundation for future applications of reinforcement learning to dialogue
systems, where learning can occur at the state level rather than the token
level. https://github.com/apple/ml-sage-dialog-gen
comment: 9 pages main text
♻ ☆ Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions
In-context learning (ICL) has emerged as an effective approach to enhance the
performance of large language models (LLMs). However, its effectiveness varies
significantly across models and tasks, posing challenges for practitioners to
determine when ICL reliably improves performance. Current evaluation
approaches, reliant on performance change after applying ICL, suffer from low
reliability, poor attribution, and impracticality in data-insufficient
scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that
quantifies ICL effectiveness by modeling the slope between learning gain (loss
decrease from demonstrations) and contextual relevance (demonstration-input
relevance). LCS addresses key limitations of performance-based metrics: (1) it
captures continuous loss changes even when outputs are incorrect, improving
reliability; (2) its formulation attributes ICL failures to weak contextual
alignment (inability to adapt inputs to demonstrations) or strong output
calibration (self-verification of correctness); and (3) it minimizes reliance
on labeled data via synthetic evaluation. Extensive experiments demonstrate
that LCS strongly correlates with performance improvements in labeled settings
and reliably reflects true effectiveness in biased or data-scarce scenarios.
Further analysis reveals actionable thresholds for LCS and identifies model
capabilities critical to ICL success.
♻ ☆ ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
Large language models (LLMs) have demonstrated potential in assisting
scientific research, yet their ability to discover high-quality research
hypotheses remains unexamined due to the lack of a dedicated benchmark. To
address this gap, we introduce the first large-scale benchmark for evaluating
LLMs with a near-sufficient set of sub-tasks of scientific discovery:
inspiration retrieval, hypothesis composition, and hypothesis ranking. We
develop an automated framework that extracts critical components - research
questions, background surveys, inspirations, and hypotheses - from scientific
papers across 12 disciplines, with expert validation confirming its accuracy.
To prevent data contamination, we focus exclusively on papers published in
2024, ensuring minimal overlap with LLM pretraining data. Our evaluation
reveals that LLMs perform well in retrieving inspirations, an
out-of-distribution task, suggesting their ability to surface novel knowledge
associations. This positions LLMs as "research hypothesis mines", capable of
facilitating automated scientific discovery by generating innovative hypotheses
at scale with minimal human intervention.
♻ ☆ ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry ACL 2025
Community Question Answering (CQA) platforms can be deemed as important
knowledge bases in community, but effectively leveraging historical
interactions and domain knowledge in real-time remains a challenge. Existing
methods often underutilize external knowledge, fail to incorporate dynamic
historical QA context, or lack memory mechanisms suited for industrial
deployment. We propose ComRAG, a retrieval-augmented generation framework for
real-time industrial CQA that integrates static knowledge with dynamic
historical QA pairs via a centroid-based memory mechanism designed for
retrieval, generation, and efficient storage. Evaluated on three industrial CQA
datasets, ComRAG consistently outperforms all baselines--achieving up to 25.9%
improvement in vector similarity, reducing latency by 8.7% to 23.3%, and
lowering chunk growth from 20.23% to 2.06% over iterations.
comment: 7 pages, 4 figures. Accepted at ACL 2025 Industry Track
♻ ☆ Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty? ACL2025
As large language models (LLMs) are increasingly used in high-stakes domains,
accurately assessing their confidence is crucial. Humans typically express
confidence through epistemic markers (e.g., "fairly confident") instead of
numerical values. However, it remains unclear whether LLMs consistently use
these markers to reflect their intrinsic confidence due to the difficulty of
quantifying uncertainty associated with various markers. To address this gap,
we first define marker confidence as the observed accuracy when a model employs
an epistemic marker. We evaluate its stability across multiple
question-answering datasets in both in-distribution and out-of-distribution
settings for open-source and proprietary LLMs. Our results show that while
markers generalize well within the same distribution, their confidence is
inconsistent in out-of-distribution scenarios. These findings raise significant
concerns about the reliability of epistemic markers for confidence estimation,
underscoring the need for improved alignment between marker based confidence
and actual model uncertainty. Our code is available at
https://github.com/HKUST-KnowComp/MarCon.
comment: ACL2025 Main
♻ ☆ RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Radiology with Zero-Shot Multi-Task Capability
Recent advancements in multi-modal models have significantly improved
vision-language (VL) alignment in radiology. However, existing approaches
struggle to effectively utilize complex radiology reports for learning and
offer limited interpretability through attention probability visualizations. To
address these challenges, we introduce RadZero, a novel framework for VL
alignment in radiology with zero-shot multi-task capability. A key component of
our approach is VL-CABS (Vision-Language Cross-Attention Based on Similarity),
which aligns text embeddings with local image features for interpretable,
fine-grained VL reasoning. RadZero leverages large language models to extract
concise semantic sentences from radiology reports and employs multi-positive
contrastive training to effectively capture relationships between images and
multiple relevant textual descriptions. It uses a pre-trained vision encoder
with additional trainable Transformer layers, allowing efficient
high-resolution image processing. By computing similarity between text
embeddings and local image patch features, VL-CABS enables zero-shot inference
with similarity probability for classification, and pixel-level VL similarity
maps for grounding and segmentation. Experimental results on public chest
radiograph benchmarks show that RadZero outperforms state-of-the-art methods in
zero-shot classification, grounding, and segmentation. Furthermore, VL
similarity map analysis highlights the potential of VL-CABS for improving
explainability in VL alignment. Additionally, qualitative evaluation
demonstrates RadZero's capability for open-vocabulary semantic segmentation,
further validating its effectiveness in medical imaging.
♻ ☆ Generative Representational Learning of Foundation Models for Recommendation
Developing a single foundation model with the capability to excel across
diverse tasks has been a long-standing objective in the field of artificial
intelligence. As the wave of general-purpose foundation models sweeps across
various domains, their influence has significantly extended to the field of
recommendation systems. While recent efforts have explored recommendation
foundation models for various generative tasks, they often overlook crucial
embedding tasks and struggle with the complexities of multi-task learning,
including knowledge sharing & conflict resolution, and convergence speed
inconsistencies. To address these limitations, we introduce RecFound, a
generative representational learning framework for recommendation foundation
models. We construct the first comprehensive dataset for recommendation
foundation models covering both generative and embedding tasks across diverse
scenarios. Based on this dataset, we propose a novel multi-task training scheme
featuring a Task-wise Mixture of Low-rank Experts (TMoLE) to handle knowledge
sharing & conflict, a Step-wise Convergence-oriented Sample Scheduler (S2Sched)
to address inconsistent convergence, and a Model Merge module to balance the
performance across tasks. Experiments demonstrate that RecFound achieves
state-of-the-art performance across various recommendation tasks, outperforming
existing baselines.
comment: Project page is available at https://junkfood436.github.io/RecFound/
♻ ☆ Pipelined Decoder for Efficient Context-Aware Text Generation
As the basis of generative AI, an autoregressive model requires the
generation of a new token depending on all the previously generated tokens,
which brings high quality but also restricts the model to generate tokens one
by one, forming a bottleneck limiting the generation speed. In this paper, we
propose a new decoder architecture that efficiently generates text in parallel
for context-aware generation tasks. Our proposed pipelined decoder initiates
the generation of multiple subsequences simultaneously, and, at each time-step,
it generates a new token for each subsequence to realize parallelism.
Experiments on multiple text generation tasks, including question answering,
text summarization, and keyphrase generation, show that our pipelined decoder
significantly improves the generation speed without a significant loss of
generation quality or additional memory consumption.
♻ ☆ Parameter-Efficient Fine-Tuning via Circular Convolution ACL 2025
Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large
foundation models, leveraging low-rank matrices $\mathbf{A}$ and $\mathbf{B}$
to represent weight changes (i.e., $\Delta \mathbf{W} = \mathbf{B}
\mathbf{A}$). This method reduces trainable parameters and mitigates heavy
memory consumption associated with full delta matrices by sequentially
multiplying $\mathbf{A}$ and $\mathbf{B}$ with the activation. Despite its
success, the intrinsic low-rank characteristic may limit its performance.
Although several variants have been proposed to address this issue, they often
overlook the crucial computational and memory efficiency brought by LoRA. In
this paper, we propose Circular Convolution Adaptation (C$^3$A), which not only
achieves high-rank adaptation with enhanced performance but also excels in both
computational power and memory utilization. Extensive experiments demonstrate
that C$^3$A consistently outperforms LoRA and its variants across various
fine-tuning tasks.
comment: ACL 2025
♻ ☆ Two-Stage Regularization-Based Structured Pruning for LLMs
Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao
The deployment of large language models (LLMs) is largely hindered by their
large number of parameters. Structural pruning has emerged as a promising
solution. Prior structured pruning methods directly remove unimportant
parameters based on certain metrics, which often causes knowledge loss and
necessitates extensive retraining. To overcome this, we introduce a novel
pruning method TRSP: Two-Stage Regularization-Based Structured Pruning for
LLMs. Specifically, we multiply the output of each transformer layer by an
initial learnable weight and iteratively learn these weights by adding their
$\ell_1$-norm as a regularization term to the loss function, serving as the
first-stage regularization. Subsequently, we apply additional regularization to
the difference between the output and input of layers with smaller weights,
encouraging the shift of knowledge to the preserved layers. This serves as the
second-stage regularization. TRSP retains more knowledge and better preserves
model performance than direct parameter elimination. Through extensive
experimentation we show that TRSP outperforms strong layer-wise structured
pruning methods without requiring retraining. As a layer-wise pruning method,
it delivers notable end-to-end acceleration, making it a promising solution for
efficient LLM deployment.
♻ ☆ Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs
Yang Dai, Jianxiang An, Tianwei Lin, Hongyang He, Hongzhe Huang, Wenqiao Zhang, Zheqi Lv, Siliang Tang, Yueting Zhuang
Multimodal Large Language Models (MLLMs) have achieved success across various
domains. However, their applicability tends to degrade when confronted with
different types of data inputs, especially for MLLMs that have been fine-tuned
for specific tasks. Despite its importance, the study of knowledge sharing
among domain-specific MLLMs--such as those trained for mathematics or
code--remains largely underexplored. To address the fragmentation of knowledge
across domain-specialized MLLMs, we propose a unified parameter integration
framework that enables modular composition of expert capabilities. Our method
is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy,
which leverages both local functional attribution and global
information-theoretic signals to guide selective parameter fusion. By extending
this mechanism to the low-rank adaptation layer granularity, we ensure
efficient integration with minimal inference overhead. Furthermore, we
introduce a domain compatibility scoring mechanism that quantifies inter-expert
alignment at the activation level and correlates with downstream task utility.
This principled fusion protocol allows the final model to synergize
heterogeneous expertise while preserving structural modularity. Extensive
evaluations across diverse multimodal benchmarks validate the effectiveness of
our framework, offering a scalable path toward compositional, domain-adaptive
MLLMs.
♻ ☆ BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference ICML 2025
The rapidly increasing size of large language models (LLMs) presents
significant challenges in memory usage and computational costs. Quantizing both
weights and activations can address these issues, with hardware-supported
fine-grained scaling emerging as a promising solution to mitigate outliers.
However, existing methods struggle to capture nuanced block data distributions.
We propose BlockDialect, a block-wise fine-grained mixed format technique that
assigns a per-block optimal number format from a formatbook for better data
representation. Additionally, we introduce DialectFP4, a formatbook of FP4
variants (akin to dialects) that adapt to diverse data distributions. To
leverage this efficiently, we propose a two-stage approach for online
DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy
efficiency by selecting representable values as scaled integers compatible with
low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy
gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit
usage per data, while being only 5.45% (2.69%) below full precision even when
quantizing full-path matrix multiplication. Focusing on how to represent over
how to scale, our work presents a promising path for energy-efficient LLM
inference.
comment: ICML 2025
♻ ☆ DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning ACL 2025
Previous multimodal sentence representation learning methods have achieved
impressive performance. However, most approaches focus on aligning images and
text at a coarse level, facing two critical challenges:cross-modal misalignment
bias and intra-modal semantic divergence, which significantly degrade sentence
representation quality. To address these challenges, we propose DALR
(Dual-level Alignment Learning for Multimodal Sentence Representation). For
cross-modal alignment, we propose a consistency learning module that softens
negative samples and utilizes semantic similarity from an auxiliary task to
achieve fine-grained cross-modal alignment. Additionally, we contend that
sentence relationships go beyond binary positive-negative labels, exhibiting a
more intricate ranking structure. To better capture these relationships and
enhance representation quality, we integrate ranking distillation with global
intra-modal alignment learning. Comprehensive experiments on semantic textual
similarity (STS) and transfer (TR) tasks validate the effectiveness of our
approach, consistently demonstrating its superiority over state-of-the-art
baselines.
comment: Accepted by ACL 2025 Findings
♻ ☆ Flexora: Flexible Low Rank Adaptation for Large Language Models
Large Language Models (LLMs) are driving advancements in artificial
intelligence by increasing the scale of model parameters, which has
significantly enhanced generalization ability and unlocked new capabilities in
practice. However, their performance in specific downstream tasks is usually
hindered by their knowledge boundaries on these tasks. Thus, fine-tuning
techniques, especially the widely used Low-Rank Adaptation (LoRA) method, have
been introduced to expand the boundaries on these tasks, whereas LoRA would
underperform on certain tasks owing to its potential overfitting on these
tasks. To overcome this overfitting and improve the performance of LoRA, we
propose the flexible low rank adaptation (Flexora) method to automatically and
flexibly select the most important layers needing to be fine-tuned to achieve
the best performance on different downstream tasks. Specifically, Flexora
firstly frames this layer selection problem as a well-defined hyperparameter
optimization (HPO) problem, then addresses it using the unrolled
differentiation (UD) method, and finally selects the most useful layers based
on the optimized hyperparameters. Our extensive experiments on many pretrained
models and natural language tasks show that Flexora is able to consistently
improve over the existing baselines, indicating the effectiveness of our
Flexora in practice. We additionally provide insightful theoretical results and
many ablation studies to deliver a comprehensive understanding of our Flexora.
comment: 40 pages, 15 figures
♻ ☆ SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques
Recent advances in reinforcement learning have shown that language models can
develop sophisticated reasoning through training on tasks with verifiable
rewards, but these approaches depend on human-curated problem-answer pairs and
domain-specific reward engineering. We introduce SPIRAL, a self-play framework
where models learn by playing multi-turn, zero-sum games against continuously
improving versions of themselves, eliminating the need for human supervision.
Through self-play, SPIRAL generates an infinite curriculum of progressively
challenging problems as models must constantly adapt to stronger opponents. To
enable this self-play training at scale, We implement a fully online,
multi-turn, multi-agent reinforcement learning system for LLMs and propose
role-conditioned advantage estimation (RAE) to stabilize multi-agent training.
Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that
transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6%
improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000
expert game trajectories. Analysis reveals that this transfer occurs through
three cognitive patterns: systematic decomposition, expected value calculation,
and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple
Negotiation) further enhances performance as each game develops distinct
reasoning strengths. Applying SPIRAL to a strong reasoning model
(DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These
results demonstrate that zero-sum games naturally develop transferable
reasoning capabilities, highlighting a promising direction for autonomous
reasoning development.
comment: Work in Progress
♻ ☆ Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples
Recent advancements in audio-aware large language models (ALLMs) enable them
to process and understand audio inputs. However, these models often hallucinate
non-existent sound events, reducing their reliability in real-world
applications. To address this, we propose LISTEN (Learning to Identify Sounds
Through Extended Negative Samples), a contrastive-like training method that
enhances ALLMs' ability to distinguish between present and absent sounds using
synthesized data from the backbone LLM. Unlike prior approaches, our method
requires no modification to LLM parameters and efficiently integrates audio
representations via a lightweight adapter. Experiments show that LISTEN
effectively mitigates hallucinations while maintaining impressive performance
on existing audio question and reasoning benchmarks. At the same time, it is
more efficient in both data and computation.
comment: Accepted to Interspeech 2025. Project Website:
https://kuan2jiu99.github.io/Balsa
♻ ☆ Seeking and Updating with Live Visual Knowledge
Mingyang Fu, Yuyang Peng, Dongping Chen, Zetong Zhou, Benlin Liu, Yao Wan, Zhou Zhao, Philip S. Yu, Ranjay Krishna
The visual world around us constantly evolves, from real-time news and social
media trends to global infrastructure changes visible through satellite imagery
and augmented reality enhancements. However, Multimodal Large Language Models
(MLLMs), which automate many tasks, struggle to stay current, limited by the
cutoff dates in their fixed training datasets. To quantify this stagnation, we
introduce LiveVQA, the first-of-its-kind dataset featuring 107,143 samples and
12 categories data specifically designed to support research in both seeking
and updating with live visual knowledge. Drawing from recent news articles,
video platforms, and academic publications in April 2024-May 2025, LiveVQA
enables evaluation of how models handle latest visual information beyond their
knowledge boundaries and how current methods help to update them. Our
comprehensive benchmarking of 17 state-of-the-art MLLMs reveals significant
performance gaps on content beyond knowledge cutoff, and tool-use or agentic
visual seeking framework drastically gain an average of 327% improvement.
Furthermore, we explore parameter-efficient fine-tuning (PEFT) methods to
update MLLMs with new visual knowledge. We dive deeply to the critical balance
between adapter capacity and model capability when updating MLLMs with new
visual knowledge. All the experimental dataset and source code are publicly
available at: https://livevqa.github.io.
comment: Preprint. Under Review
♻ ☆ SPADE: Structured Prompting Augmentation for Dialogue Enhancement in Machine-Generated Text Detection ACL
The increasing capability of large language models (LLMs) to generate
synthetic content has heightened concerns about their misuse, driving the
development of Machine-Generated Text (MGT) detection models. However, these
detectors face significant challenges due to the lack of high-quality synthetic
datasets for training. To address this issue, we propose SPADE, a structured
framework for detecting synthetic dialogues using prompt-based positive and
negative samples. Our proposed methods yield 14 new dialogue datasets, which we
benchmark against eight MGT detection models. The results demonstrate improved
generalization performance when utilizing a mixed dataset produced by proposed
augmentation frameworks, offering a practical approach to enhancing LLM
application security. Considering that real-world agents lack knowledge of
future opponent utterances, we simulate online dialogue detection and examine
the relationship between chat history length and detection accuracy. Our
open-source datasets, code and prompts can be downloaded from
https://github.com/AngieYYF/SPADE-customer-service-dialogue.
comment: ACL LLMSEC