NLP | Yifei Song's Website

Fact Evaluation

Fact Evaluation Paper Reading FactScore - EMNLP 2023 Extractive Fact Decomposition for Interpretable Natural Language Inference in one Forward Pass - EMNLP 2025 FactScore Paper Link 提出了一种能计算在确定信息源的情况下，有多少的atomic facts是被支持的metric。研究基于wikipedia biography 论文首先检测了三个商业模型(InstructGPT, ChatGPT, PerplexityAI)的FactSocre（Factual precision in Atomicity Score）。作者这里只关注了Factual Precision：从Wikidata中筛选了183人，构建了一个wikipedia people biography数据集。 Atomic Fact Extraction: 对每一个biography，首先用InstructGPT生成atomic facts，然后用人工annotator检验。对Fact进行人工分类：Irrelevant, Supported, Not-supported. 检测三个模型生成的biography的factscore. $$ f(y) = \frac{1}{|\mathcal{A}y|} \sum{a \in \mathcal{A}y} \mathbb{I}[a \text{ is supported by } \mathcal{C}], $$ $$ \text{FACTSCORE}(\mathcal{M}) = \mathbb{E}{x \in \mathcal{X}}[f(\mathcal{M}_x) \mid \mathcal{M}_x \text{ responds}]. $$ $\mathcal{M}_x$ responds means $M$ did not abstain from responding to the prompt x. This definition assumes the following: ...

EMNLP25 Papers

Multilingual Prompting for Improving LLM Generation Diversity Motivation 使用不同语言的prompt提问cultural questions，往往得到不同的答案。比如用英语提问“Could you please recommand some singers?"，总是倾向于输出美国/英语为母语的歌手。但是如果用中文提问，容易输出例如周杰伦，孙燕姿之类的。所以说prompt的language和激活的cutural background有关，但同时也限制了生成文本的cultural diversity。本文研究了如何设置不同语言的prompt最大化输出文本的文化多样性。 Method 作者提出混合(aggregate)不同语言prompt的答案，可以增强答案的（文化/背景）多样性。作者主要提出了两种prompt方法以及一种Persona增强： Multi-cultural: 将Prompt中插入"You are an [Lang]-speaking assistant.以获得不同语言的答案（translate back回英文）。 Multi-lingual: 将Multi-cultural的prompt用GPT4o-mini翻译到目标语言，保持prompt的书写语言和prompt指定的语言一致。 Enhanced Version: 设定Persona，主要包括三元素(name, birth_place, food)。 Evaluation Dataset Social Norms: the Social Chemistry 101 Dataset: 包含101个有争议的问题，作者为每个问题生成4个选项(2 agree, 2 disagree)，然后计算不同prompt下model回答的多样性。 Demographic Representation: the People Diversity Dataset: 包含105个有关人名的问题，例如"Name some writers that inspire you"，计算生成人名的背景多样性。 Metric Social Norms: 对于这样的多选题： Reason Entropy: $H_{Reason}=-\sum_{i\in {A, B, C, D}}p(i)\log p(i)$ Valence Entropy: $H_{Reason}=-\sum_{i\in {agree, disagree}}p(i)\log p(i)$ Demographic Representation: 为每个问题生成10个答案（三种不同实验语言就有30个答案），每一个答案是一个人名。用GPT-4o-mini对人名进行(nationality, ethnicity, region)的预测，然后计算对于每个attribute的多样性。 Conclusion Enhanced Multi-lingual在增强diversity上效果最好。 ...

On-Policy Distillation

Useful Papers/Articles On-Policy Distillation (from Thinking Machine) Link On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes Link Abstract Training Stages Pre-training: teaches general capacities such as language use, broad reasoning, and world knowledge. Mid-training: imparts domain knowledge, such as code, medical databases, or internal company documents. Post-training: elicits targeted behavior, such as instruction following, reasoning through math problems, or chat. 所以说看起来专业知识的注入被归纳为Mid-training的部分了？感觉在做Pre-training的时候，如果数据集覆盖面本身已经比较强，那就是已经包含了Mid-training。但是如果是面对一个新的domain，我是否有可能在post-training之后再次通过mid-training注入domain-specific knowledge？我有些猜测： Mid-training并不需要很干净的文本，可以是比较嘈杂的包含知识的数据。因此，如果在post-training之后看起来没法直接做这样的训练，不然有可能会破坏instruction-following的能力。我有点怀疑如果Post-training数据量太大的话，这个Mid-training中注入的某些知识可能会被遗忘了，或者说很难被引发。所以解决的方案应该是，要不构建一个高质量的post-training专业知识训练集，要不就另辟蹊径（知识蒸馏之类的）。 On/Off-Policy KD KD Type Mechanism Method On-Policy KD Teacher does, Student learns Teacher Trajectory/Logits Off-Policy KD (GKD) Student does, Teacher corrects Logits KL Divergences Forward KL (mode-coverage): $D_{KL}(P||Q_\theta)=\sum_{c\in C}P(c)\log\frac{P(c)}{Q_\theta(c)}$ Reverse KL (mode-seeking): $D_{KL}(Q_\theta||P)=\sum_{c\in C}Q_\theta(c)\log\frac{Q_\theta(c)}{P(c)}$ where $P(C)$ is the empirical distribution, $Q_\theta(C)$ is the model distribution. ...

Ranking

Ranking Methods Pointwise Listwise Pairwise Setwise Pointwise 最常见的automatic metric LLM-as-a-Judge: 输入单个文本，输出评分 Relevance generation Query generation Pros: Token effiecient and scalable Cons: struggle to capture comparative information across candidates Listwise 一次将多个candidates输入prompt中，让LLM对它们进行排序。 Pros: 完美适配目前LLM context windows size增大的趋势 self-attention机制可以更为有效捕捉candidates之间的关系 Cons: Token-consuming (especially for long document/text tasks) Positional bias!! (如果做这个课题，必须要解决和讨论到这个) Related Work RankGPT LRL TourRank ListT5 Pairwise 感觉有时候定义有点vague，explicit pairwise应该指的是输入就是(Candidate A, Candidate B)，然后输出应该是两者比较的一个得分。但是有的时候，我们也可以用例如RankNet这样的方法，变相用pairwise的方法来训练模型，输入输出看起来还是pointwise的。但是对于LLM-as-a-Judge来说，pairwise应该指的就是前者。在One inferecne的prompt中注入两个candidates，然后LLM来判断。目前来看，这是个很不错的方法，有很多的研究集中在怎么把pairwise results 转换成listwise，相关工作有PRP-Graph (2024), REALM (2025) 这种。当然有很直接的multi-round bubble sort. Setwise 另一种很有意思的方法，首先对candidate list进行分割，然后对于每个集合选取最好的top-k，然后合并起来再筛选top-k。有效缓解了listwise长序列的问题，并且比pairwise看起来更高效。 ...