On-Policy Distillation

Useful Papers/Articles On-Policy Distillation (from Thinking Machine) Link On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes Link Abstract Training Stages Pre-training: teaches general capacities such as language use, broad reasoning, and world knowledge. Mid-training: imparts domain knowledge, such as code, medical databases, or internal company documents. Post-training: elicits targeted behavior, such as instruction following, reasoning through math problems, or chat. 所以说看起来专业知识的注入被归纳为Mid-training的部分了?感觉在做Pre-training的时候,如果数据集覆盖面本身已经比较强,那就是已经包含了Mid-training。但是如果是面对一个新的domain,我是否有可能在post-training之后再次通过mid-training注入domain-specific knowledge?我有些猜测: Mid-training并不需要很干净的文本,可以是比较嘈杂的包含知识的数据。因此,如果在post-training之后看起来没法直接做这样的训练,不然有可能会破坏instruction-following的能力。 我有点怀疑如果Post-training数据量太大的话,这个Mid-training中注入的某些知识可能会被遗忘了,或者说很难被引发。 所以解决的方案应该是,要不构建一个高质量的post-training专业知识训练集,要不就另辟蹊径(知识蒸馏之类的)。 On/Off-Policy KD KD Type Mechanism Method On-Policy KD Teacher does, Student learns Teacher Trajectory/Logits Off-Policy KD (GKD) Student does, Teacher corrects Logits KL Divergences Forward KL (mode-coverage): $D_{KL}(P||Q_\theta)=\sum_{c\in C}P(c)\log\frac{P(c)}{Q_\theta(c)}$ Reverse KL (mode-seeking): $D_{KL}(Q_\theta||P)=\sum_{c\in C}Q_\theta(c)\log\frac{Q_\theta(c)}{P(c)}$ where $P(C)$ is the empirical distribution, $Q_\theta(C)$ is the model distribution. ...

October 2025 · Yifei

Ranking

Ranking Methods Pointwise Listwise Pairwise Setwise Pointwise 最常见的automatic metric LLM-as-a-Judge: 输入单个文本,输出评分 Relevance generation Query generation Pros: Token effiecient and scalable Cons: struggle to capture comparative information across candidates Listwise 一次将多个candidates输入prompt中,让LLM对它们进行排序。 Pros: 完美适配目前LLM context windows size增大的趋势 self-attention机制可以更为有效捕捉candidates之间的关系 Cons: Token-consuming (especially for long document/text tasks) Positional bias!! (如果做这个课题,必须要解决和讨论到这个) Related Work RankGPT LRL TourRank ListT5 Pairwise 感觉有时候定义有点vague,explicit pairwise应该指的是输入就是(Candidate A, Candidate B),然后输出应该是两者比较的一个得分。但是有的时候,我们也可以用例如RankNet这样的方法,变相用pairwise的方法来训练模型,输入输出看起来还是pointwise的。 但是对于LLM-as-a-Judge来说,pairwise应该指的就是前者。在One inferecne的prompt中注入两个candidates,然后LLM来判断。 目前来看,这是个很不错的方法,有很多的研究集中在怎么把pairwise results 转换成listwise,相关工作有PRP-Graph (2024), REALM (2025) 这种。当然有很直接的multi-round bubble sort. Setwise 另一种很有意思的方法,首先对candidate list进行分割,然后对于每个集合选取最好的top-k,然后合并起来再筛选top-k。 有效缓解了listwise长序列的问题,并且比pairwise看起来更高效。 ...

August 2025 · Yifei